Collaborative Filtering

In [73]:
import pandas as pd

data = pd.read_csv('toy_dataset.csv', index_col=0)
# index_col=0, bcz if we don't use it then there will be another columns titled as 'Unnamed: 0', which will contain User 1,...

data
Out[73]:
action1 action2 action3 romantic1 romantic2 romantic3
user 1 4.0 5.0 3.0 NaN 2.0 1.0
user 2 5.0 3.0 3.0 2.0 2.0 NaN
user 3 1.0 NaN NaN 4.0 5.0 4.0
user 4 NaN 2.0 1.0 4.0 NaN 3.0
user 5 1.0 NaN 2.0 3.0 3.0 4.0

Overview of data

The dataset is about user 1's rating over action1 movie, action2 movie, action3 movie, romantic2 movie and romantic3 movie. The user 1 doesn't watch the romantic1 movie, so there is no review over there! (Which is more practical)

Methods

There are 2 methods in Collaborative Filttering

  • User To User Collaborative Filttering
  • Item To Item Collaborative Filttering

Item To Item Collaborative Filttering is generally used and more accurate and scientific also!

Item to Item Method

Here, we put just mean values of that column in the NaN values of corresponding columns. There are a lots of other processes to deal with NaN values, it's just one od those processes.

In [16]:
data.mean()
Out[16]:
action1      2.750000
action2      3.333333
action3      2.250000
romantic1    3.250000
romantic2    3.000000
romantic3    3.000000
dtype: float64
In [22]:
for i in data.columns:
    data[i].fillna(round(data[i].mean()), inplace=True)
In [23]:
data
Out[23]:
action1 action2 action3 romantic1 romantic2 romantic3
user 1 4.0 5.0 3.0 3.0 2.0 1.0
user 2 5.0 3.0 3.0 2.0 2.0 3.0
user 3 1.0 3.0 2.0 4.0 5.0 4.0
user 4 3.0 2.0 1.0 4.0 3.0 3.0
user 5 1.0 3.0 2.0 3.0 3.0 4.0

Rough For Better Understanding

In [58]:
a=pd.DataFrame([[1,2],[3,4]])
a
Out[58]:
0 1
0 1 2
1 3 4
In [48]:
a.sum()
Out[48]:
0    4
1    6
dtype: int64

Default is axis = 0, it means that, we are going to sum those elements Column-wise!

In [47]:
a.sum(axis=1)
Out[47]:
0    3
1    7
dtype: int64

Our Defined Method

standardized=XX.mean()X.max()X.min()
In [41]:
def standardized(col):
    new_col = (col-col.mean())/(col.max()-col.min())
    return new_col
standardized(a)
Out[41]:
0 1
0 -0.5 -0.5
1 0.5 0.5

Min Max Normalization

MinMaxNormalization=XX.min()X.max()X.min()

Default is that, we will transform the values between [0, 1]

In [61]:
from sklearn.preprocessing import MinMaxScaler
s_1 = MinMaxScaler()
s_1.fit_transform(a)
Out[61]:
array([[0., 0.],
       [1., 1.]])

Standardization

Standardization=Xμσ
In [50]:
from sklearn.preprocessing import StandardScaler
s_2 = StandardScaler()
s_2.fit_transform(a)
Out[50]:
array([[-1., -1.],
       [ 1.,  1.]])

Final Decision

Our defined method & Standardization are giving suitable outcomes. And for final work we will go for Standardization

In [69]:
scaler = StandardScaler()
data_std_numpy = scaler.fit_transform(data)
data_std_numpy
Out[69]:
array([[ 0.75      ,  1.83711731,  1.06904497, -0.26726124, -0.91287093,
        -1.82574186],
       [ 1.375     , -0.20412415,  1.06904497, -1.60356745, -0.91287093,
         0.        ],
       [-1.125     , -0.20412415, -0.26726124,  1.06904497,  1.82574186,
         0.91287093],
       [ 0.125     , -1.22474487, -1.60356745,  1.06904497,  0.        ,
         0.        ],
       [-1.125     , -0.20412415, -0.26726124, -0.26726124,  0.        ,
         0.91287093]])
In [72]:
data_std_df = pd.DataFrame(data_std_numpy, columns = data.columns, index = data.index)
data_std_df
Out[72]:
action1 action2 action3 romantic1 romantic2 romantic3
user 1 0.750 1.837117 1.069045 -0.267261 -0.912871 -1.825742
user 2 1.375 -0.204124 1.069045 -1.603567 -0.912871 0.000000
user 3 -1.125 -0.204124 -0.267261 1.069045 1.825742 0.912871
user 4 0.125 -1.224745 -1.603567 1.069045 0.000000 0.000000
user 5 -1.125 -0.204124 -0.267261 -0.267261 0.000000 0.912871

There are 2 methods to find similarities between two movies

    1. Cosine Distance
    1. Pearson Correlation

1. Cosine Distance

As we are doing the Item to Item collaborative filtering, and we need to find similarities between the movies! Thus I have to make the Transpose of the current daraframe and have to do the cosine_similarity

In [74]:
data_std_df_T = data_std_df.T
data_std_df_T
Out[74]:
user 1 user 2 user 3 user 4 user 5
action1 0.750000 1.375000 -1.125000 0.125000 -1.125000
action2 1.837117 -0.204124 -0.204124 -1.224745 -0.204124
action3 1.069045 1.069045 -0.267261 -1.603567 -0.267261
romantic1 -0.267261 -1.603567 1.069045 1.069045 -0.267261
romantic2 -0.912871 -0.912871 1.825742 0.000000 0.000000
romantic3 -1.825742 0.000000 0.912871 0.000000 0.912871
In [77]:
from sklearn.metrics.pairwise import cosine_similarity
similar_matrix_numpy = cosine_similarity(data_std_df_T)
similar_matrix_df = pd.DataFrame(similar_matrix_numpy, columns = data.columns, index = data.columns)
similar_matrix_df
Out[77]:
action1 action2 action3 romantic1 romantic2 romantic3
action1 1.000000 0.280671 0.534522 -0.634745 -0.798762 -0.684653
action2 0.280671 1.000000 0.763763 -0.327327 -0.372678 -0.745356
action3 0.534522 0.763763 1.000000 -0.785714 -0.487950 -0.487950
romantic1 -0.634745 -0.327327 -0.785714 1.000000 0.731925 0.243975
romantic2 -0.798762 -0.372678 -0.487950 0.731925 1.000000 0.666667
romantic3 -0.684653 -0.745356 -0.487950 0.243975 0.666667 1.000000

2. Pearson Correlation

Here, you don't need to transpose the main matrix. The pearson correlation calculates cosine distance internally, thus both of them are giveing same result!!!

In [78]:
corrMatrix = data_std_df.corr(method='pearson')
corrMatrix
Out[78]:
action1 action2 action3 romantic1 romantic2 romantic3
action1 1.000000 0.280671 0.534522 -0.634745 -0.798762 -0.684653
action2 0.280671 1.000000 0.763763 -0.327327 -0.372678 -0.745356
action3 0.534522 0.763763 1.000000 -0.785714 -0.487950 -0.487950
romantic1 -0.634745 -0.327327 -0.785714 1.000000 0.731925 0.243975
romantic2 -0.798762 -0.372678 -0.487950 0.731925 1.000000 0.666667
romantic3 -0.684653 -0.745356 -0.487950 0.243975 0.666667 1.000000

So, we can use anyone for our final model! Let us use the corrMatrix for our final prediction!

Keep In Mind

In the correlation matrix, the +ve value near to 1 is denoting that those 2 movies are too much related to each other. On other hand, the -ve value near to -1 is denoting that those 2 movies are related in the negative sence. Now in normal case we take the abs() of the correlation matrix, thus we can find the consider both positive relation and negative relation as same! Like the following :

corrMatrix_temp = data_std_df.corr(method='pearson').abs()
corrMatrix_temp

             action1     action2     action3     romantic1   romantic2   romantic3
action1   1.000000  0.280671    0.534522    0.634745    0.798762    0.684653
action2   0.280671  1.000000    0.763763    0.327327    0.372678    0.745356
action3   0.534522  0.763763    1.000000    0.785714    0.487950    0.487950
romantic1   0.634745    0.327327    0.785714    1.000000    0.731925    0.243975
romantic2   0.798762    0.372678    0.487950    0.731925    1.000000    0.666667
romantic3   0.684653    0.745356    0.487950    0.243975    0.666667    1.000000

But, here it will be a problem! Let me describe it. See the following image

The -ve correlation will be like Euclidean Distance, so if we calculate via -ve correlation, then there will be dissimilar output, but the graph of +ve correlation be like the Angular Distance

Conclusion

For the above reason, we are considering the Angular Distance i.e. -ve correlation here. Thus we aren't doing the abs() on the correlation matrix

In [90]:
number_of_total_recommended_movies=6
highest_rating, lowest_rating=5, 0

mean_of_rating = (highest_rating - lowest_rating)/2
def get_similar_movies(movie_name, user_rating):
    similar_score = corrMatrix[movie_name]*(user_rating - mean_of_rating)
    similar_score = similar_score.sort_values(ascending = False)
    return similar_score

Here, (user_rating - mean_of_rating) is the main logic. Unless you do it, you can't get right recommendation when you will rate a movie 1 or 2!

In [91]:
print(get_similar_movies('action1', 5))
action1      2.500000
action3      1.336306
action2      0.701677
romantic1   -1.586864
romantic3   -1.711633
romantic2   -1.996905
Name: action1, dtype: float64
In [92]:
print(get_similar_movies('romantic1', 1))
action3      1.178571
action1      0.952118
action2      0.490990
romantic3   -0.365963
romantic2   -1.097888
romantic1   -1.500000
Name: romantic1, dtype: float64
In [93]:
def if_user_rates_multiple_movies(movies_history):
    similar_scores = pd.DataFrame()
    for movie, rating in movies_history:
        similar_scores = similar_scores.append(get_similar_movies(movie, rating),ignore_index = True)
    return similar_scores
In [122]:
user_1 = [['action1', 5], ['romantic1', 2], ['romantic3', 1]]
final_df = if_user_rates_multiple_movies(user_1)
final_df
Out[122]:
action1 action2 action3 romantic1 romantic2 romantic3
0 2.500000 0.701677 1.336306 -1.586864 -1.996905 -1.711633
1 0.317373 0.163663 0.392857 -0.500000 -0.365963 -0.121988
2 1.026980 1.118034 0.731925 -0.365963 -1.000000 -1.500000
In [123]:
final_df.sum().sort_values(ascending=False)
Out[123]:
action1      3.844353
action3      2.461088
action2      1.983374
romantic1   -2.452826
romantic3   -3.333621
romantic2   -3.362868
dtype: float64

Final Step

Final step is not to recommend those movies which a user is already watched!

In [113]:
def get_fresh_recommendation(final_similarity_df, previous_history):
    index_in_final = final_similarity_df.index
    
    user_saw = []
    for i in range(len(previous_history)):
        user_saw.append(previous_history[i][0])
        
    final_recommendation = pd.DataFrame()
    for index, each_movie in enumerate(index_in_final):
        if each_movie not in user_saw:
            final_recommendation = final_recommendation.append(final_similarity_df.iloc[index,:])
    return final_recommendation
In [129]:
all_sum = pd.DataFrame(final_df.sum().sort_values(ascending=False), columns=['similarity'])
all_sum
Out[129]:
similarity
action1 3.844353
action3 2.461088
action2 1.983374
romantic1 -2.452826
romantic3 -3.333621
romantic2 -3.362868
In [128]:
get_fresh_recommendation(all_sum, user_1)
Out[128]:
similarity
action3 2.461088
action2 1.983374
romantic2 -3.362868