Aggregation

def util_mat_agg(df_u, df_i, u_agg='sum', i_agg='sum'):
    """
    Aggregates the results of the clustering with respect to item clusters and user clusters.
    
    Parameters
    ----------
    df_u    : pandas DataFrame
              Dataframe of the user_ids with cluster assignments and
              item ratings of users from the original matrix
    
    df_i    : pandas DataFrame
              Dataframe of the item_ids with cluster assignments and 
              item ratings of users from the original matrix
    
    u_agg   : str
              Aggregration method to be used for users through 'sum' or 'mean' 
    
    i_agg   : str
              Aggregration method to be used for items through 'sum' or 'mean' 
              
    Returns
    -------
    util_mat   : pandas DataFrame
                 utility matrix consisting of the aggregrated user 
                 clusters as rows and aggregated item clusters as columns
    """
    import numpy as np
    import pandas as pd
    
    u_series = df_u['u_cluster']
    i_series = df_i['i_cluster']

    u_ids = np.unique(u_series.values)
    i_ids = np.unique(i_series.values) 

    u_feats = {}
    for u_id in u_ids: #u_ids are clusters of u_id
        sub_df = df_u.groupby('u_cluster').get_group(
            u_id).drop(columns=['u_cluster']).T
        sub_df = sub_df.merge(i_series, left_index=True, right_index=True)
        
        if u_agg == 'sum':
            df_grp = sub_df.groupby('i_cluster').sum()
        if u_agg == 'mean':
            df_grp = sub_df.groupby('i_cluster').mean()
        if not isinstance(u_agg,str):
            df_grp = sub_df.groupby('i_cluster').apply(u_agg)
            
        if i_agg == 'sum':
            df_grp = df_grp.sum(axis=1)
        if i_agg == 'mean':
            df_grp = df_grp.mean(axis=1)
        if not isinstance(i_agg,str):
            df_grp = df_grp.apply(i_agg, axis=1)
        
        u_feats[u_id] = df_grp
    

    u_matrix = pd.DataFrame()
    for k, v in u_feats.items():
        u_matrix = u_matrix.merge(v.rename(k), how='outer',
                                  left_index=True, right_index=True)

    util_mat = u_matrix.fillna(0).T
    util_mat.index.rename('u_cluster', inplace=True)
    return util_mat
util_mat_agg(df_u, df_i, 'mean', 'mean')
i_cluster 0 1 2 3 4
u_cluster
0 0.155453 0.531532 0.944444 0.727240 0.894180
1 0.001109 0.010294 0.054167 0.016229 0.187710
2 0.006609 0.095030 0.302757 0.054039 0.259259
3 0.034548 0.415233 0.727517 0.235731 0.658730
4 0.027294 0.056212 0.516762 0.298726 0.580766

Aggregation by sum(or mean):

  1. Users are identified to belong to their respective user clusters. Membership is limited to one user cluster per user.

  2. Within that user cluster, the items rated are identified and labeled with their respective item clusters.

  3. The first step of aggregation is done within each cluster where users are grouped according to the clusters of the items.

  4. The second step is to get the “sum” (“mean”) of each item clusters.

  5. Steps 3 and 4 are repeated for all clusters.

The following diagrams show sample clusters of users, where columns are the users in the cluster and rows are their ratings of the items. For each cluster, steps 3 and 4 are done using “sum” as the method of aggregation. image image image

  1. The result of step 5 for all clusters are aggregated into a final utility matrix. image

Unit Test

import unittest
import pandas as pd
from pandas._testing import assert_frame_equal
import numpy as np

class Test_util_matrix_agg(unittest.TestCase):
    
    def test_util_mat_agg(self):
        arru = np.array([[1,1,1,0,1,0,1],[0,0,0,1,1,0,1],[1,1,1,1,1,1,1],[0,1,1,0,0,1,1],[0,1,1,1,1,1,0],
                 [1,1,1,0,0,0,1],[1,1,0,0,0,1,1]])
        df_ut = pd.DataFrame(arru, index=[1,2,3,4,5,6,7])
        df_ut['u_cluster'] = [0,0,1,2,0,2,1]
        
        arri = np.transpose(arru)
        df_it = pd.DataFrame(arri)
        df_it['i_cluster'] = [0,3,1,2,0,2,1]
        
        agg_arr = np.array([[4,4,3,2],[3,3,3,2],[1,4,1,2]])
        agg_arr = np.array([[4,4,3,2],[3,3,3,2],[1,4,1,2]])
        df_a = pd.DataFrame(agg_arr, index=[0,1,2])
        df_a.index.name='u_cluster'
        df_a.columns=[0,1,2,3]
        
        umt= util_mat_agg(df_ut, df_it, "sum", "sum")
        assert_frame_equal(df_a, umt, check_dtype=True, check_names=False)
    
unittest.main(argv=[''], verbosity=2, exit=False)
test_util_mat_agg (__main__.Test_util_matrix_agg) ... ok

----------------------------------------------------------------------
Ran 1 test in 0.027s

OK
<unittest.main.TestProgram at 0x7f3aac133f40>