import pandas as pd
data = pd.read_csv('toy_dataset.csv', index_col=0)
# index_col=0, bcz if we don't use it then there will be another columns titled as 'Unnamed: 0', which will contain User 1,...
data
Overview of data
The dataset is about user 1's rating over action1 movie, action2 movie, action3 movie, romantic2 movie and romantic3 movie. The user 1 doesn't watch the romantic1 movie, so there is no review over there! (Which is more practical)
Methods
There are 2 methods in Collaborative Filttering
Item To Item Collaborative Filttering is generally used and more accurate and scientific also!
Here, we put just mean values of that column in the NaN values of corresponding columns. There are a lots of other processes to deal with NaN values, it's just one od those processes.
data.mean()
for i in data.columns:
data[i].fillna(round(data[i].mean()), inplace=True)
data
Rough For Better Understanding
a=pd.DataFrame([[1,2],[3,4]])
a
a.sum()
Default is axis = 0
, it means that, we are going to sum those elements Column-wise!
a.sum(axis=1)
Our Defined Method
def standardized(col):
new_col = (col-col.mean())/(col.max()-col.min())
return new_col
standardized(a)
Min Max Normalization
Default is that, we will transform the values between [0, 1]
from sklearn.preprocessing import MinMaxScaler
s_1 = MinMaxScaler()
s_1.fit_transform(a)
Standardization
from sklearn.preprocessing import StandardScaler
s_2 = StandardScaler()
s_2.fit_transform(a)
Final Decision
Our defined method
& Standardization
are giving suitable outcomes. And for final work we will go for Standardization
scaler = StandardScaler()
data_std_numpy = scaler.fit_transform(data)
data_std_numpy
data_std_df = pd.DataFrame(data_std_numpy, columns = data.columns, index = data.index)
data_std_df
There are 2 methods to find similarities between two movies
1. Cosine Distance
As we are doing the Item to Item collaborative filtering, and we need to find similarities between the movies! Thus I have to make the Transpose of the current daraframe and have to do the cosine_similarity
data_std_df_T = data_std_df.T
data_std_df_T
from sklearn.metrics.pairwise import cosine_similarity
similar_matrix_numpy = cosine_similarity(data_std_df_T)
similar_matrix_df = pd.DataFrame(similar_matrix_numpy, columns = data.columns, index = data.columns)
similar_matrix_df
2. Pearson Correlation
Here, you don't need to transpose the main matrix. The pearson correlation calculates cosine distance
internally, thus both of them are giveing same result!!!
corrMatrix = data_std_df.corr(method='pearson')
corrMatrix
So, we can use anyone for our final model! Let us use the corrMatrix
for our final prediction!
Keep In Mind
In the correlation matrix, the +ve
value near to 1 is denoting that those 2 movies are too much related to each other. On other hand, the -ve
value near to -1 is denoting that those 2 movies are related in the negative sence. Now in normal case we take the abs()
of the correlation matrix, thus we can find the consider both positive relation and negative relation as same! Like the following :
corrMatrix_temp = data_std_df.corr(method='pearson').abs()
corrMatrix_temp
action1 action2 action3 romantic1 romantic2 romantic3
action1 1.000000 0.280671 0.534522 0.634745 0.798762 0.684653
action2 0.280671 1.000000 0.763763 0.327327 0.372678 0.745356
action3 0.534522 0.763763 1.000000 0.785714 0.487950 0.487950
romantic1 0.634745 0.327327 0.785714 1.000000 0.731925 0.243975
romantic2 0.798762 0.372678 0.487950 0.731925 1.000000 0.666667
romantic3 0.684653 0.745356 0.487950 0.243975 0.666667 1.000000
But, here it will be a problem! Let me describe it. See the following image
The -ve
correlation will be like Euclidean Distance, so if we calculate via -ve
correlation, then there will be dissimilar output, but the graph of +ve
correlation be like the Angular Distance
Conclusion
For the above reason, we are considering the Angular Distance i.e. -ve
correlation here. Thus we aren't doing the abs()
on the correlation matrix
number_of_total_recommended_movies=6
highest_rating, lowest_rating=5, 0
mean_of_rating = (highest_rating - lowest_rating)/2
def get_similar_movies(movie_name, user_rating):
similar_score = corrMatrix[movie_name]*(user_rating - mean_of_rating)
similar_score = similar_score.sort_values(ascending = False)
return similar_score
Here, (user_rating - mean_of_rating)
is the main logic. Unless you do it, you can't get right recommendation when you will rate a movie 1 or 2!
print(get_similar_movies('action1', 5))
print(get_similar_movies('romantic1', 1))
def if_user_rates_multiple_movies(movies_history):
similar_scores = pd.DataFrame()
for movie, rating in movies_history:
similar_scores = similar_scores.append(get_similar_movies(movie, rating),ignore_index = True)
return similar_scores
user_1 = [['action1', 5], ['romantic1', 2], ['romantic3', 1]]
final_df = if_user_rates_multiple_movies(user_1)
final_df
final_df.sum().sort_values(ascending=False)
Final Step
Final step is not to recommend those movies which a user is already watched!
def get_fresh_recommendation(final_similarity_df, previous_history):
index_in_final = final_similarity_df.index
user_saw = []
for i in range(len(previous_history)):
user_saw.append(previous_history[i][0])
final_recommendation = pd.DataFrame()
for index, each_movie in enumerate(index_in_final):
if each_movie not in user_saw:
final_recommendation = final_recommendation.append(final_similarity_df.iloc[index,:])
return final_recommendation
all_sum = pd.DataFrame(final_df.sum().sort_values(ascending=False), columns=['similarity'])
all_sum
get_fresh_recommendation(all_sum, user_1)