Content-based Filtering¶

Quickstart¶

# transaction list
transaction_list.head()

	userId	movieId	rating	timestamp
0	1	1	4.0	964982703
1	1	3	4.0	964981247
2	1	6	4.0	964982224
3	1	47	5.0	964983815
4	1	50	5.0	964982931

# item ids and features
item_df.head()

	movieId	i_1	i_2	i_3	i_4	i_5	i_6	i_7	i_8	i_9	...	i_291	i_292	i_293	i_294	i_295	i_296	i_297	i_298	i_299	i_300
0	1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.000000	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.000000	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	6	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.513025	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	47	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.000000	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	50	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.000000	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 301 columns

# user ids and features
user_df.head()

	userId	u_1	u_2	u_3	u_4	u_5	u_6	u_7	u_8	u_9	...	u_11	u_12	u_13	u_14	u_15	u_16	u_17	u_18	u_19	u_20
0	1	85	29	42	83	47	26	90	45	55	...	17	68	22	7	40	22	1	0	0	0
1	2	3	0	0	7	0	1	11	10	10	...	1	17	1	1	4	0	0	4	3	0
2	3	11	4	5	9	4	5	14	2	7	...	8	16	5	0	15	1	0	0	0	0
3	4	29	6	10	104	19	58	25	27	38	...	4	120	7	10	12	16	4	1	2	0
4	5	8	6	9	15	7	11	9	12	9	...	1	25	3	2	2	5	0	3	0	0

5 rows × 21 columns

Load ContentBasedModel¶

from resype.content_based import ContentBasedModel

cb = ContentBasedModel(user_df,
                        item_df,
                        transaction_list,
                        item_id_name='movieId',
                        user_id_name='userId',
                        target_name='rating',
                        timestamp_name='timestamp')

Train-test Split¶

#Do your train and test split
cb.split_train_test(train_ratio = 0.7) #To split the data

#Do your train and test split
cb.split_train_test_chronological(train_ratio = 0.7) #To split the data

cb.df_test

	userId	movieId	rating	timestamp	u_1	u_2	u_3	u_4	u_5	u_6	...	i_291	i_292	i_293	i_294	i_295	i_296	i_297	i_298	i_299	i_300
5	1	70	3.0	964982400	85	29	42	83	47	26	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
8	1	151	5.0	964984041	85	29	42	83	47	26	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
9	1	157	5.0	964984100	85	29	42	83	47	26	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
16	1	296	3.0	964982967	85	29	42	83	47	26	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
18	1	333	5.0	964981179	85	29	42	83	47	26	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
980	7	8665	3.5	1108602755	54	14	15	49	23	30	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
981	7	8666	1.0	1106779625	54	14	15	49	23	30	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
983	7	8798	4.5	1106636602	54	14	15	49	23	30	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
984	7	8808	1.5	1109746594	54	14	15	49	23	30	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
989	7	8949	4.0	1110757890	54	14	15	49	23	30	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

301 rows × 324 columns

cb.df_train

	userId	movieId	rating	timestamp	u_1	u_2	u_3	u_4	u_5	u_6	...	i_291	i_292	i_293	i_294	i_295	i_296	i_297	i_298	i_299	i_300
206	1	3243	3.0	964981093	85	29	42	83	47	26	...	0.0	0.00000	0.0	0.000000	0.0	0.000000	0.0	0.000000	0.0	0.0
207	1	3247	3.0	964983108	85	29	42	83	47	26	...	0.0	0.60429	0.0	0.000000	0.0	0.000000	0.0	0.000000	0.0	0.0
225	1	3729	5.0	964982363	85	29	42	83	47	26	...	0.0	0.00000	0.0	0.000000	0.0	0.000000	0.0	0.000000	0.0	0.0
201	1	3053	5.0	964984086	85	29	42	83	47	26	...	0.0	0.00000	0.0	0.000000	0.0	0.202585	0.0	0.000000	0.0	0.0
82	1	1256	5.0	964981442	85	29	42	83	47	26	...	0.0	0.00000	0.0	0.000000	0.0	0.000000	0.0	0.000000	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
972	7	7155	3.0	1173051472	54	14	15	49	23	30	...	0.0	0.00000	0.0	0.590782	0.0	0.000000	0.0	0.000000	0.0	0.0
921	7	2683	2.0	1106635420	54	14	15	49	23	30	...	0.0	0.00000	0.0	0.000000	0.0	0.171763	0.0	0.000000	0.0	0.0
944	7	4700	1.5	1109746591	54	14	15	49	23	30	...	0.0	0.00000	0.0	0.000000	0.0	0.000000	0.0	0.478203	0.0	0.0
897	7	1196	4.0	1106635996	54	14	15	49	23	30	...	0.0	0.00000	0.0	0.000000	0.0	0.000000	0.0	0.000000	0.0	0.0
968	7	6863	4.5	1106712827	54	14	15	49	23	30	...	0.0	0.00000	0.0	0.000000	0.0	0.000000	0.0	0.000000	0.0	0.0

699 rows × 324 columns

Fit model¶

from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(random_state=202109, n_jobs=-1)
cb.fit_ml_cb(model)

RandomForestRegressor(n_jobs=-1, random_state=202109)

# show trained model
cb.model

RandomForestRegressor(n_jobs=-1, random_state=202109)

Predict on Test Set¶

#Predict on the test data
preds_array = cb.reco_ml_cb_tt() #To make predictions as an array
print(len(preds_array))
preds_array

array([4.07 , 4.18 , 4.26 , 4.15 , 4.24 , 4.45 , 4.26 , 4.53 , 4.69 ,
       3.72 , 4.68 , 4.3  , 4.97 , 4.14 , 3.45 , 4.83 , 4.56 , 4.86 ,
       4.18 , 4.82 , 4.73 , 3.6  , 4.96 , 4.9  , 4.89 , 4.48 , 4.7  ,
       4.59 , 4.74 , 4.2  , 4.76 , 4.9  , 4.17 , 4.39 , 4.61 , 4.27 ,
       4.05 , 4.46 , 4.26 , 4.88 , 4.6  , 4.45 , 4.39 , 4.49 , 3.23 ,
       4.61 , 4.98 , 4.85 , 4.82 , 4.5  , 4.14 , 4.57 , 4.2  , 4.45 ,
       4.52 , 4.32 , 4.7  , 3.74 , 4.45 , 4.87 , 4.6  , 4.29 , 4.51 ,
       4.43 , 4.71 , 4.75 , 4.55 , 4.41 , 4.64 , 4.74 , 3.495, 2.17 ,
       3.86 , 3.25 , 4.225, 3.5  , 4.41 , 3.175, 3.76 , 4.13 , 3.275,
       3.575, 3.585, 3.36 , 3.84 , 3.625, 4.05 , 3.265, 3.5  , 4.175,
       3.795, 3.62 , 3.19 , 3.9  , 4.23 , 3.99 , 3.315, 3.975, 3.66 ,
       3.44 , 4.09 , 2.875, 3.35 , 3.015, 3.88 , 3.92 , 3.23 , 3.955,
       4.1  , 3.5  , 2.895, 3.51 , 3.285, 3.4  , 3.285, 3.555, 4.02 ,
       3.335, 4.17 , 3.435, 4.16 , 3.9  , 4.2  , 3.54 , 4.075, 3.36 ,
       3.65 , 3.965, 3.615, 3.86 , 3.17 , 3.11 , 4.175, 3.02 , 4.08 ,
       3.975, 3.24 , 4.105, 3.745, 3.67 , 4.18 , 4.14 , 4.2  , 3.87 ,
       3.75 , 3.625, 3.63 , 4.08 , 3.835, 3.92 , 3.645, 4.275, 3.69 ,
       3.285, 4.045, 3.015, 3.675, 3.63 , 3.705, 3.06 , 3.335, 3.57 ,
       3.99 , 2.195, 3.14 , 3.285, 3.265, 3.95 , 3.07 , 3.49 , 3.48 ,
       3.5  , 3.45 , 3.285, 3.53 , 3.14 , 3.22 , 3.38 , 3.49 , 3.17 ,
       3.54 , 3.4  , 3.565, 4.   , 3.345, 3.78 , 2.795, 3.46 , 3.71 ,
       3.505, 3.675, 3.9  , 2.99 , 3.2  , 3.345, 3.325, 3.64 , 3.28 ,
       3.615, 3.935, 3.06 , 3.445, 4.085, 3.87 , 3.86 , 2.91 , 3.32 ,
       3.82 , 3.79 , 3.225, 3.285, 3.56 , 2.895, 3.5  , 2.935, 3.31 ,
       2.965, 2.55 , 3.735, 3.91 , 3.08 , 3.63 , 3.525, 3.175, 3.8  ,
       3.64 , 3.645, 3.405, 2.49 , 3.37 , 3.24 , 3.175, 3.45 , 3.615,
       3.585, 3.335, 3.81 , 3.875, 3.69 , 4.145, 3.21 , 3.825, 2.285,
       3.19 , 2.655, 3.645, 3.715, 3.62 , 3.585, 3.5  , 3.355, 4.025,
       3.   , 3.535, 3.645, 3.36 , 3.44 , 3.375, 3.63 , 2.815, 4.195,
       3.56 , 3.51 , 4.15 , 4.305, 4.19 , 3.645, 3.96 , 3.605, 3.475,
       3.04 , 3.96 , 3.975, 3.29 , 2.715, 3.695, 3.36 , 4.28 , 3.95 ,
       3.105, 3.14 , 3.725, 3.725, 3.19 , 3.16 , 3.885, 3.88 , 3.405,
       3.925, 3.73 , 3.77 , 3.715, 1.995, 2.65 , 3.77 , 3.87 , 3.045,
       3.785, 3.775, 3.62 , 2.835])

Get Recommendations¶

cb.get_rec(user_list=[0, 1, 3], top_n=3)

Predicting utility matrix: 7 out of 7

	user_id	rank_1	rank_2	rank_3
0	0	163981.0	52042.0	117368.0
1	1	122882.0	596.0	7381.0
2	3	2858.0	125.0	4967.0

Evaluate on test set¶

mse, mae = cb.evaluate_test_set() #MSE and MAE
mse, mae

proceed

(1.4762376245847177, 0.9313455149501663)

Cross Validation¶

mse, mae = cb.cross_val(model, k=3)

0
Starting splitting
Finished splitting
Starting training
Finished training
Starting computing MAE and MSE
proceed
Finished computing MAE and MSE
1
Starting splitting
Finished splitting
Starting training
Finished training
Starting computing MAE and MSE
proceed
Finished computing MAE and MSE
2
Starting splitting
Finished splitting
Starting training
Finished training
Starting computing MAE and MSE
proceed
Finished computing MAE and MSE

mse, mae

([1.4898759966777408, 1.221593899192318, 1.5457277466777408],
 [0.9358637873754153, 0.8729953725676317, 0.9878654485049833])

previous

Collaborative Filtering

next

Performance Comparison with Surprise