Collaborative Filtering¶

Methods to find the similarities between them ¶

The dataset is about user 1's rating over action1 movie, action2 movie, action3 movie, romantic2 movie and romantic3 movie. The user 1 doesn't watch the romantic1 movie, so there is no review over there! (Which is more practical)

Methods

There are 2 methods in Collaborative Filttering

User To User Collaborative Filttering
Item To Item Collaborative Filttering

Item To Item Collaborative Filttering is generally used and more accurate and scientific also!

Item to Item Method¶

1. NaN Value Handling ¶

Here, we put just mean values of that column in the NaN values of corresponding columns. There are a lots of other processes to deal with NaN values, it's just one od those processes.

data.mean()

action1      2.750000
action2      3.333333
action3      2.250000
romantic1    3.250000
romantic2    3.000000
romantic3    3.000000
dtype: float64

for i in data.columns:
    data[i].fillna(round(data[i].mean()), inplace=True)

data

2. Standerization Of Data ¶

Rough For Better Understanding

a=pd.DataFrame([[1,2],[3,4]])
a

a.sum()

0    4
1    6
dtype: int64

Default is axis = 0, it means that, we are going to sum those elements Column-wise!

a.sum(axis=1)

0    3
1    7
dtype: int64

Our Defined Method

def standardized(col):
    new_col = (col-col.mean())/(col.max()-col.min())
    return new_col
standardized(a)

Min Max Normalization

Default is that, we will transform the values between [0, 1]

from sklearn.preprocessing import MinMaxScaler
s_1 = MinMaxScaler()
s_1.fit_transform(a)

array([[0., 0.],
       [1., 1.]])

Standardization

from sklearn.preprocessing import StandardScaler
s_2 = StandardScaler()
s_2.fit_transform(a)

array([[-1., -1.],
       [ 1.,  1.]])

Final Decision

Our defined method & Standardization are giving suitable outcomes. And for final work we will go for Standardization

scaler = StandardScaler()
data_std_numpy = scaler.fit_transform(data)
data_std_numpy

array([[ 0.75      ,  1.83711731,  1.06904497, -0.26726124, -0.91287093,
        -1.82574186],
       [ 1.375     , -0.20412415,  1.06904497, -1.60356745, -0.91287093,
         0.        ],
       [-1.125     , -0.20412415, -0.26726124,  1.06904497,  1.82574186,
         0.91287093],
       [ 0.125     , -1.22474487, -1.60356745,  1.06904497,  0.        ,
         0.        ],
       [-1.125     , -0.20412415, -0.26726124, -0.26726124,  0.        ,
         0.91287093]])

data_std_df = pd.DataFrame(data_std_numpy, columns = data.columns, index = data.index)
data_std_df

3. Methods to find the similarities between them ¶

There are 2 methods to find similarities between two movies

1. Cosine Distance
1. Pearson Correlation

1. Cosine Distance

As we are doing the Item to Item collaborative filtering, and we need to find similarities between the movies! Thus I have to make the Transpose of the current daraframe and have to do the cosine_similarity

data_std_df_T = data_std_df.T
data_std_df_T

from sklearn.metrics.pairwise import cosine_similarity
similar_matrix_numpy = cosine_similarity(data_std_df_T)
similar_matrix_df = pd.DataFrame(similar_matrix_numpy, columns = data.columns, index = data.columns)
similar_matrix_df

2. Pearson Correlation

Here, you don't need to transpose the main matrix. The pearson correlation calculates cosine distance internally, thus both of them are giveing same result!!!

corrMatrix = data_std_df.corr(method='pearson')
corrMatrix

So, we can use anyone for our final model! Let us use the corrMatrix for our final prediction!

4. Predictions on New Data ¶

Keep In Mind

In the correlation matrix, the +ve value near to 1 is denoting that those 2 movies are too much related to each other. On other hand, the -ve value near to -1 is denoting that those 2 movies are related in the negative sence. Now in normal case we take the abs() of the correlation matrix, thus we can find the consider both positive relation and negative relation as same! Like the following :

corrMatrix_temp = data_std_df.corr(method='pearson').abs()
corrMatrix_temp

             action1     action2     action3     romantic1   romantic2   romantic3
action1   1.000000  0.280671    0.534522    0.634745    0.798762    0.684653
action2   0.280671  1.000000    0.763763    0.327327    0.372678    0.745356
action3   0.534522  0.763763    1.000000    0.785714    0.487950    0.487950
romantic1   0.634745    0.327327    0.785714    1.000000    0.731925    0.243975
romantic2   0.798762    0.372678    0.487950    0.731925    1.000000    0.666667
romantic3   0.684653    0.745356    0.487950    0.243975    0.666667    1.000000

But, here it will be a problem! Let me describe it. See the following image

The -ve correlation will be like Euclidean Distance, so if we calculate via -ve correlation, then there will be dissimilar output, but the graph of +ve correlation be like the Angular Distance

Conclusion

For the above reason, we are considering the Angular Distance i.e. -ve correlation here. Thus we aren't doing the abs() on the correlation matrix

5. Making Final Recommendation ¶

number_of_total_recommended_movies=6
highest_rating, lowest_rating=5, 0

mean_of_rating = (highest_rating - lowest_rating)/2
def get_similar_movies(movie_name, user_rating):
    similar_score = corrMatrix[movie_name]*(user_rating - mean_of_rating)
    similar_score = similar_score.sort_values(ascending = False)
    return similar_score

Here, (user_rating - mean_of_rating) is the main logic. Unless you do it, you can't get right recommendation when you will rate a movie 1 or 2!

print(get_similar_movies('action1', 5))

action1      2.500000
action3      1.336306
action2      0.701677
romantic1   -1.586864
romantic3   -1.711633
romantic2   -1.996905
Name: action1, dtype: float64

print(get_similar_movies('romantic1', 1))

action3      1.178571
action1      0.952118
action2      0.490990
romantic3   -0.365963
romantic2   -1.097888
romantic1   -1.500000
Name: romantic1, dtype: float64

def if_user_rates_multiple_movies(movies_history):
    similar_scores = pd.DataFrame()
    for movie, rating in movies_history:
        similar_scores = similar_scores.append(get_similar_movies(movie, rating),ignore_index = True)
    return similar_scores

user_1 = [['action1', 5], ['romantic1', 2], ['romantic3', 1]]
final_df = if_user_rates_multiple_movies(user_1)
final_df

final_df.sum().sort_values(ascending=False)

action1      3.844353
action3      2.461088
action2      1.983374
romantic1   -2.452826
romantic3   -3.333621
romantic2   -3.362868
dtype: float64

Final Step

Final step is not to recommend those movies which a user is already watched!

def get_fresh_recommendation(final_similarity_df, previous_history):
    index_in_final = final_similarity_df.index
    
    user_saw = []
    for i in range(len(previous_history)):
        user_saw.append(previous_history[i][0])
        
    final_recommendation = pd.DataFrame()
    for index, each_movie in enumerate(index_in_final):
        if each_movie not in user_saw:
            final_recommendation = final_recommendation.append(final_similarity_df.iloc[index,:])
    return final_recommendation

all_sum = pd.DataFrame(final_df.sum().sort_values(ascending=False), columns=['similarity'])
all_sum

get_fresh_recommendation(all_sum, user_1)

	user 1	user 2	user 3	user 4	user 5
action1	0.750000	1.375000	-1.125000	0.125000	-1.125000
action2	1.837117	-0.204124	-0.204124	-1.224745	-0.204124
action3	1.069045	1.069045	-0.267261	-1.603567	-0.267261
romantic1	-0.267261	-1.603567	1.069045	1.069045	-0.267261
romantic2	-0.912871	-0.912871	1.825742	0.000000	0.000000
romantic3	-1.825742	0.000000	0.912871	0.000000	0.912871

	action1	action2	action3	romantic1	romantic2	romantic3
user 1	0.750	1.837117	1.069045	-0.267261	-0.912871	-1.825742
user 2	1.375	-0.204124	1.069045	-1.603567	-0.912871	0.000000
user 3	-1.125	-0.204124	-0.267261	1.069045	1.825742	0.912871
user 4	0.125	-1.224745	-1.603567	1.069045	0.000000	0.000000
user 5	-1.125	-0.204124	-0.267261	-0.267261	0.000000	0.912871

	action1	action2	action3	romantic1	romantic2	romantic3
action1	1.000000	0.280671	0.534522	-0.634745	-0.798762	-0.684653
action2	0.280671	1.000000	0.763763	-0.327327	-0.372678	-0.745356
action3	0.534522	0.763763	1.000000	-0.785714	-0.487950	-0.487950
romantic1	-0.634745	-0.327327	-0.785714	1.000000	0.731925	0.243975
romantic2	-0.798762	-0.372678	-0.487950	0.731925	1.000000	0.666667
romantic3	-0.684653	-0.745356	-0.487950	0.243975	0.666667	1.000000

	action1	action2	action3	romantic1	romantic2	romantic3
0	2.500000	0.701677	1.336306	-1.586864	-1.996905	-1.711633
1	0.317373	0.163663	0.392857	-0.500000	-0.365963	-0.121988
2	1.026980	1.118034	0.731925	-0.365963	-1.000000	-1.500000

	similarity
action1	3.844353
action3	2.461088
action2	1.983374
romantic1	-2.452826
romantic3	-3.333621
romantic2	-3.362868

	action1	action2	action3	romantic1	romantic2	romantic3
user 1	4.0	5.0	3.0	NaN	2.0	1.0
user 2	5.0	3.0	3.0	2.0	2.0	NaN
user 3	1.0	NaN	NaN	4.0	5.0	4.0
user 4	NaN	2.0	1.0	4.0	NaN	3.0
user 5	1.0	NaN	2.0	3.0	3.0	4.0

	action1	action2	action3	romantic1	romantic2	romantic3
user 1	4.0	5.0	3.0	3.0	2.0	1.0
user 2	5.0	3.0	3.0	2.0	2.0	3.0
user 3	1.0	3.0	2.0	4.0	5.0	4.0
user 4	3.0	2.0	1.0	4.0	3.0	3.0
user 5	1.0	3.0	2.0	3.0	3.0	4.0

	0	1
0	-0.5	-0.5
1	0.5	0.5