我家自动化

这个屌丝很懒，什么也没留下！

热门标签

音乐推荐系统实战

作者：我家自动化 | 2024-04-07 15:50:08

踩

音乐推荐系统

文章目录

1. 项目背景
2. 数据处理
3. 推荐系统
4. 总结

1. 项目背景

我们准备做一个音乐推荐系统，拥有的数据集是一个triplet_dataset.txt文件，大小大约为3GB：

在这里插入图片描述

数据集中有近5000万条数据，每条数据大概是这样用户|歌曲|播放次数：

在这里插入图片描述

之后会对数据进行转换，便于计算任务，减少计算时使用到的内存。

2. 数据处理

2.1 统计用户播放总量

output_dict = {}
with open(data_home + 'train_triplets.txt') as f:
    for line_number, line in enumerate(f):
        user = line.split('\t')[0]
        play_count = int(line.split('\t')[2])
        # 如果统计过该用户，则将该用户播放歌曲+1
        if user in output_dict:
            play_count += output_dict[user] 
            output_dict.update({user:play_count})
        output_dict.update({user:play_count})
output_list = [{'user':k, 'play_count':v} for k, v in output_dict.items()] 
play_count_df = pd.DataFrame(output_list)
# 将用户按照播放量从高到低排序（之后过滤掉播放量太少的用户）
song_count_df = play_count_df.sort_values(by='play_count', ascending=False)
1
2
3
4
5
6
7
8
9
10
11
12
13
14

将统计好的用户播放总量保存到文件user_playcount_df.cs文件中：

song_count_df.to_csv(path_or_buf='user_playcount_df.csv', index=False)
1

查看文件：

在这里插入图片描述

2.2 统计歌曲播放总量

output_dict = {}
with open(data_home + 'train_triplets.txt') as f:
    for line_number, line in enumerate(f):
        song = line.split('\t')[1]
        play_count = int(line.split('\t')[2])
        if song in output_dict:
            play_count += output_dict[song]
            output_dict.update({song:play_count})
        output_dict.update({song:play_count})
output_list = [{'song':k, 'play_count':v} for k, v in output_dict.items()]
song_count_df = pd.DataFrame(output_list)
# 将歌曲按照播放量从高到低排序（之后过滤掉播放量太少的用户）
song_count_df = song_count_df.sort_values(by='play_count', ascending=False)
1
2
3
4
5
6
7
8
9
10
11
12
13

将统计好的歌曲播放总量保存到文件song_playcount_df.csv文件中：

song_count_df.to_csv(path_or_buf='song_playcount_df.csv', index=False)
1

查看文件：

在这里插入图片描述

2.3 过滤出实验数据

前10万用户播放占比40%多

total_play_count = sum(song_count_df.play_count) # 所有歌曲的播放量
print((float(play_count_df.head(n=100000).play_count.sum()) / total_play_count) * 100) # 前10万用户播放总量占比
play_count_subset = play_count_df.head(n=100000)
1
2
3

输出：

40.8807280500655
1

前3万歌曲播放占比78%多

(float(song_count_df.head(n=30000).play_count.sum()) / total_play_count) * 100 # 前3万首歌曲播放占比
1

输出：

78.39315366645269

1
2

取10w个用户，3w首歌曲：

user_subset = list(play_count_subset.user)
song_subset = list(song_count_subset.song)
1
2

过滤掉其他用户数据：

triplet_dataset = pd.read_csv(filepath_or_buffer=data_home + 'train_triplets.txt', sep='\t',
                             header=None, names=['user', 'song', 'play_count'])
triplet_dataset_sub = triplet_dataset[triplet_dataset.user.isin(user_subset)]
del(triplet_dataset)
triplet_dataset_sub_song = triplet_dataset_sub[triplet_dataset_sub.song.isin(song_subset)]
del(triplet_dataset_sub)
1
2
3
4
5
6

将过滤后的数据保存到triplet_dataset_sub_song.csv文件中

triplet_dataset_sub_song.to_csv(path_or_buf=data_home + 'triplet_dataset_sub_song.csv', index=False)
1

查看文件：

在这里插入图片描述

查看过滤后的数据量：

在这里插入图片描述

2.4 加入音乐详情信息

.db文件需要稍微处理下转换成csv

conn = sqlite3.connect(data_home + 'track_metadata.db')
cur = conn.cursor()
cur.execute("SELECT name FROM sqlite_master WHERE type='table'")
cur.fetchall()
1
2
3
4

输出：

[('songs',)]
1

track_metadata_df = pd.read_sql(con=conn, sql='select * from songs')
track_metadata_df_sub = track_metadata_df[track_metadata_df.song_id.isin(song_subset)]
1
2

track_metadata_df_sub.to_csv(path_or_buf=data_home + 'track_metadata_df_sub.csv', index=False) # 生成csv文件
1

track_metadata_df_sub.shape
1

输出：

(30447, 14)
1

查看现有信息：

triplet_dataset_sub_song = pd.read_csv(filepath_or_buffer=data_home + 'triplet_dataset_sub_song.csv', encoding="ISO-8859-1")
track_metadata_df_sub = pd.read_csv(filepath_or_buffer=data_home + 'track_metadata_df_sub.csv', encoding="ISO-8859-1")
1
2

triplet_dataset_sub_song.head()
1

输出：

在这里插入图片描述

track_metadata_df_sub.head()
1

输出：

在这里插入图片描述

此时我们可以通过song_id把歌曲与歌名对应起来了。

清洗数据集：

# 去除掉无用的和重复的
del(track_metadata_df_sub['track_id'])
del(track_metadata_df_sub['artist_mbid'])
track_metadata_df_sub = track_metadata_df_sub.drop_duplicates(['song_id'])
triplet_dataset_sub_song_merged = pd.merge(triplet_dataset_sub_song, track_metadata_df_sub, how='left', left_on='song', right_on='song_id')
triplet_dataset_sub_song_merged.rename(columns={'play_count': 'listen_count'}, inplace=True)
1
2
3
4
5
6

del(triplet_dataset_sub_song_merged['song_id'])
del(triplet_dataset_sub_song_merged['artist_id'])
del(triplet_dataset_sub_song_merged['duration'])
del(triplet_dataset_sub_song_merged['artist_familiarity'])
del(triplet_dataset_sub_song_merged['artist_hotttnesss'])
del(triplet_dataset_sub_song_merged['track_7digitalid'])
del(triplet_dataset_sub_song_merged['shs_perf'])
del(triplet_dataset_sub_song_merged['shs_work'])
1
2
3
4
5
6
7
8

搞定数据：

在这里插入图片描述

2.5 查看音乐集情况

查看播放量最高的20首歌曲：

import matplotlib.pyplot as plt; plt.rcdefaults()
import numpy as np
import matplotlib.pyplot as plt
#按歌曲名字来统计其播放量的总数
popular_songs = triplet_dataset_sub_song_merged[['title','listen_count']].groupby('title').sum().reset_index()
#对结果进行排序
popular_songs_top_20 = popular_songs.sort_values('listen_count', ascending=False).head(n=20)

#转换成list格式方便画图
objects = (list(popular_songs_top_20['title']))
#设置位置
y_pos = np.arange(len(objects))
#对应结果值
performance = list(popular_song_top_20['listen_count'])
#绘图
plt.bar(y_pos, performance, align='center', alpha=0.5)
plt.xticks(y_pos, objects, rotation='vertical')
plt.ylabel('Item count')
plt.title('Most popular songs')

plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

在这里插入图片描述

查看最受欢迎的release：

#按专辑名字来统计播放总量
popular_release = triplet_dataset_sub_song_merged[['release', 'listen_count']].groupby('release').sum().reset_index()
#排序
popular_release_top_20 = popular_release.sort_values('listen_count', ascending=False).head(n=20)

objects = (list(popular_release_top_20['release']))
y_pos = np.arange(len(objects))
performance = list(popular_release_top_20['listen_count'])

plt.bar(y_pos, performance, align='center', alpha=0.5)
plt.xticks(y_pos, objects, rotation='vertical')
plt.ylabel('Item count')
plt.title('Most popular Release')

plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

在这里插入图片描述

查看最受欢迎的歌手：

popular_artist = triplet_dataset_sub_song_merged[['artist_name', 'listen_count']].groupby('artist_name').sum().reset_index()
popular_artist_top_20 = popular_artist.sort_values('listen_count', ascending=False).head(n=20)

objects = (list(popular_artist_top_20['artist_name']))
y_pos = np.arange(len(objects))
performance = list(popular_artist_top_20['listen_count'])

plt.bar(y_pos, performance, align='center', alpha=0.5)
plt.xticks(y_pos, objects, rotation='vertical')
plt.ylabel('Item count')
plt.title('Most popular Artist')
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12

在这里插入图片描述

2.6 查看用户播放量的分布

user_song_count_distribution = triplet_dataset_sub_song_merged[['user','title']].groupby('user').count().reset_index().sort_values(
by='title',ascending = False)
user_song_count_distribution.title.describe()
1
2
3

输出：

count    99996.000000
mean       107.749890
std         79.742561
min          1.000000
25%         53.000000
50%         89.000000
75%        141.000000
max       1189.000000
Name: title, dtype: float64
1
2
3
4
5
6
7
8
9

x = user_song_count_distribution.title
n, bins, patches = plt.hist(x, 50, facecolor='green', alpha=0.75)
plt.xlabel('Play Counts')
plt.ylabel('Num of Users')
plt.title(r'$\mathrm{Histogram\ of\ User\ Play\ Count\ Distribution}\ $')
plt.grid(True)
plt.show()
1
2
3
4
5
6
7

在这里插入图片描述

3. 推荐系统

3.1 排行榜单排序

对新用户来说需要解决冷启动问题，最简单的推荐方式就是排行榜单了，这里我们创建了一个函数，需要我们传入的是原始数据，用户列名，待统计的指标（例如按歌曲名字，歌手名字，专辑名字。选择统计哪项指标得到的排行榜单）：

triplet_dataset_sub_song_merged_set = triplet_dataset_sub_song_merged
train_data, test_data = train_test_split(triplet_dataset_sub_song_merged_set, test_size=0.40, random_state=0)
1
2

train_data.head()
1

输出：

在这里插入图片描述

def create_popularity_recommendation(train_data, user_id, item_id):
    #根据指定的特征来统计其播放情况，可以选择歌曲名，专辑名，歌手名
    train_data_grouped = train_data.groupby([item_id]).agg({user_id: 'count'}).reset_index()
    #为了直观展示，我们用得分来表示其结果
    train_data_grouped.rename(columns = {user_id: 'score'}, inplace=True)
    #排行榜单需要排序
    train_data_sort = train_data_grouped.sort_values(['score', item_id], ascending = [0,1])
    #加入一项排行等级，表示其推荐的优先级
    train_data_sort['Rank'] = train_data_sort['score'].rank(ascending=0, method='first')
    #返回指定个数的推荐结果
    popularity_recommendations = train_data_sort.head(20)
    return popularity_recommendations
1
2
3
4
5
6
7
8
9
10
11
12

recommendations = create_popularity_recommendation(triplet_dataset_sub_song_merged,'user','title')
recommendations
1
2

输出：

在这里插入图片描述

返回了一份前20的歌曲排行榜单，其中的得分这里只是进行了简单的播放计算，在设计的时候也可以综合考虑更多的指标，比如综合计算歌曲发布年份，歌手的流行程度等。

3.2 基于歌曲相似度的推荐（协同过滤）

接下来就要进行相似度的计算来推荐歌曲了，为了加快代码的运行速度，选择了其中一部分数据来进行实验。

song_count_subset = song_count_df.head(n=5000)
user_subset = list(play_count_subset.user)
song_subset = list(song_count_subset.song)
triplet_dataset_sub_song_merged_sub = triplet_dataset_sub_song_merged[triplet_dataset_sub_song_merged.song.isin(song_subset)]
1
2
3
4

triplet_dataset_sub_song_merged_sub.head()
1

输出：

在这里插入图片描述

首先导入了Recommenders，是一个自定义的工具包，这里就包括了我们接下来要使用的所有函数，由于接下来进行计算的代码量较大，直接在notebook中进行展示比较麻烦，所有自己写了一个.py文件，所有的实际计算操作都在这里完成了。

Recommenders.py文件：

# Thanks to Siraj Raval for this module
# Refer to https://github.com/llSourcell/recommender_live for more details

import numpy as np
import pandas

#Class for Popularity based Recommender System model
class popularity_recommender_py():
    def __init__(self):
        self.train_data = None
        self.user_id = None
        self.item_id = None
        self.popularity_recommendations = None
        
    #Create the popularity based recommender system model
    def create(self, train_data, user_id, item_id):
        self.train_data = train_data
        self.user_id = user_id
        self.item_id = item_id

        #Get a count of user_ids for each unique song as recommendation score
        train_data_grouped = train_data.groupby([self.item_id]).agg({self.user_id: 'count'}).reset_index()
        train_data_grouped.rename(columns = {user_id: 'score'},inplace=True)
    
        #Sort the songs based upon recommendation score
        train_data_sort = train_data_grouped.sort_values(['score', self.item_id], ascending = [0,1])
    
        #Generate a recommendation rank based upon score
        train_data_sort['Rank'] = train_data_sort['score'].rank(ascending=0, method='first')
        
        #Get the top 10 recommendations
        self.popularity_recommendations = train_data_sort.head(10)

    #Use the popularity based recommender system model to
    #make recommendations
    def recommend(self, user_id):    
        user_recommendations = self.popularity_recommendations
        
        #Add user_id column for which the recommendations are being generated
        user_recommendations['user_id'] = user_id
    
        #Bring user_id column to the front
        cols = user_recommendations.columns.tolist()
        cols = cols[-1:] + cols[:-1]
        user_recommendations = user_recommendations[cols]
        
        return user_recommendations
    

#Class for Item similarity based Recommender System model
class item_similarity_recommender_py():
    def __init__(self):
        self.train_data = None
        self.user_id = None
        self.item_id = None
        self.cooccurence_matrix = None
        self.songs_dict = None
        self.rev_songs_dict = None
        self.item_similarity_recommendations = None
        
    #Get unique items (songs) corresponding to a given user
    def get_user_items(self, user):
        user_data = self.train_data[self.train_data[self.user_id] == user]
        user_items = list(user_data[self.item_id].unique())
        
        return user_items
        
    #Get unique users for a given item (song)
    def get_item_users(self, item):
        item_data = self.train_data[self.train_data[self.item_id] == item]
        item_users = set(item_data[self.user_id].unique())
            
        return item_users
        
    #Get unique items (songs) in the training data
    def get_all_items_train_data(self):
        all_items = list(self.train_data[self.item_id].unique())
            
        return all_items
        
    #Construct cooccurence matrix
    def construct_cooccurence_matrix(self, user_songs, all_songs):
            
        ####################################
        #Get users for all songs in user_songs.
        # 现在要计算的是给我选中的测试用户推荐什么
        # 流程如下
        # 1. 先把选中的测试用户听过的歌曲都拿到
        # 2. 找出这些歌曲中每一个歌曲都被那些其他用户听过
        # 3. 在整个歌曲集中遍历每一个歌曲，计算它与选中测试用户中每一个听过歌曲的Jaccard相似系数
        # 通过听歌的人的交集与并集情况来计算
        ####################################
        user_songs_users = []        
        for i in range(0, len(user_songs)):
            user_songs_users.append(self.get_item_users(user_songs[i]))
            
        ###############################################
        #Initialize the item cooccurence matrix of size 
        #len(user_songs) X len(songs)
        ###############################################
        cooccurence_matrix = np.matrix(np.zeros(shape=(len(user_songs), len(all_songs))), float)
           
        #############################################################
        #Calculate similarity between user songs and all unique songs
        #in the training data
        #############################################################
        for i in range(0,len(all_songs)):
            #Calculate unique listeners (users) of song (item) i
            songs_i_data = self.train_data[self.train_data[self.item_id] == all_songs[i]]
            users_i = set(songs_i_data[self.user_id].unique())
            
            for j in range(0,len(user_songs)):       
                    
                #Get unique listeners (users) of song (item) j
                users_j = user_songs_users[j]
                    
                #Calculate intersection of listeners of songs i and j
                users_intersection = users_i.intersection(users_j)
                
                #Calculate cooccurence_matrix[i,j] as Jaccard Index
                if len(users_intersection) != 0:
                    #Calculate union of listeners of songs i and j
                    users_union = users_i.union(users_j)
                    
                    cooccurence_matrix[j,i] = float(len(users_intersection))/float(len(users_union))
                else:
                    cooccurence_matrix[j,i] = 0
                    
        
        return cooccurence_matrix

    
    #Use the cooccurence matrix to make top recommendations
    def generate_top_recommendations(self, user, cooccurence_matrix, all_songs, user_songs):
        print("Non zero values in cooccurence_matrix :%d" % np.count_nonzero(cooccurence_matrix))
        
        #Calculate a weighted average of the scores in cooccurence matrix for all user songs.
        user_sim_scores = cooccurence_matrix.sum(axis=0)/float(cooccurence_matrix.shape[0])
        user_sim_scores = np.array(user_sim_scores)[0].tolist()
 
        #Sort the indices of user_sim_scores based upon their value
        #Also maintain the corresponding score
        sort_index = sorted(((e,i) for i,e in enumerate(list(user_sim_scores))), reverse=True)
    
        #Create a dataframe from the following
        columns = ['user_id', 'song', 'score', 'rank']
        #index = np.arange(1) # array of numbers for the number of samples
        df = pandas.DataFrame(columns=columns)
         
        #Fill the dataframe with top 10 item based recommendations
        rank = 1 
        for i in range(0,len(sort_index)):
            if ~np.isnan(sort_index[i][0]) and all_songs[sort_index[i][1]] not in user_songs and rank <= 10:
                df.loc[len(df)]=[user,all_songs[sort_index[i][1]],sort_index[i][0],rank]
                rank = rank+1
        
        #Handle the case where there are no recommendations
        if df.shape[0] == 0:
            print("The current user has no songs for training the item similarity based recommendation model.")
            return -1
        else:
            return df
 
    #Create the item similarity based recommender system model
    def create(self, train_data, user_id, item_id):
        self.train_data = train_data
        self.user_id = user_id
        self.item_id = item_id

    #Use the item similarity based recommender system model to
    #make recommendations
    def recommend(self, user):
        
        ########################################
        #A. Get all unique songs for this user
        ########################################
        user_songs = self.get_user_items(user)    
            
        print("No. of unique songs for the user: %d" % len(user_songs))
        
        ######################################################
        #B. Get all unique items (songs) in the training data
        ######################################################
        all_songs = self.get_all_items_train_data()
        
        print("no. of unique songs in the training set: %d" % len(all_songs))
         
        ###############################################
        #C. Construct item cooccurence matrix of size 
        #len(user_songs) X len(songs)
        ###############################################
        cooccurence_matrix = self.construct_cooccurence_matrix(user_songs, all_songs)
        
        #######################################################
        #D. Use the cooccurence matrix to make recommendations
        #######################################################
        df_recommendations = self.generate_top_recommendations(user, cooccurence_matrix, all_songs, user_songs)
                
        return df_recommendations
    
    #Get similar items to given items
    def get_similar_items(self, item_list):
        
        user_songs = item_list
        
        ######################################################
        #B. Get all unique items (songs) in the training data
        ######################################################
        all_songs = self.get_all_items_train_data()
        
        print("no. of unique songs in the training set: %d" % len(all_songs))
         
        ###############################################
        #C. Construct item cooccurence matrix of size 
        #len(user_songs) X len(songs)
        ###############################################
        cooccurence_matrix = self.construct_cooccurence_matrix(user_songs, all_songs)
        
        #######################################################
        #D. Use the cooccurence matrix to make recommendations
        #######################################################
        user = ""
        df_recommendations = self.generate_top_recommendations(user, cooccurence_matrix, all_songs, user_songs)
         
        return df_recommendations
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225

整体的代码量还是比较多，我先从整体上介绍这段代码做了一件什么事，大家在自己玩的时候最好按照我之前说的还是debug一遍更舒服。首先我们要针对某一个用户进行推荐，那必然得先得到他都听过哪些歌曲，通过这些已被听过的歌曲跟整个数据集中的歌曲进行对比，看哪些歌曲跟用户已听过的比较类似，推荐的就是这些类似的。如何计算呢？

例如当前用户听过了66首歌曲，整个数据集中有4879个歌曲，我们要做的就是构建一个[66,4879]的矩阵，其中每一个值表示用户听过的每一个歌曲和数据集中每一个歌曲的相似度。这里使用Jaccard相似系数，矩阵中[i,j]的含义就是用户听过的第i首歌曲这些歌曲被哪些人听过，比如有3000人听过，数据集中的j歌曲被哪些人听过，比如有5000人听过。Jaccard相似系数就要求：

在这里插入图片描述

说白了就是如果两个歌曲很相似，那其受众应当是一致的，交集/并集的比例应该比较大，如果两个歌曲没啥相关性，其值应当就比较小了。
上述代码中计算了矩阵[66,4879]中每一个位置的值应当是多少，在最后推荐的时候我们还应当注意一件事对于数据集中每一个待推荐的歌曲都需要跟该用户所有听过的歌曲计算其Jaccard值，例如歌曲j需要跟用户听过的66个歌曲计算其值，最终是否推荐的得分值还得进行处理，即把这66个值加在一起，最终求一个平均值，来代表该歌曲的推荐得分。

#执行推荐
is_model.recommend(user_id)
1
2

在这里插入图片描述

3.3 基于矩阵分解（SVD）的推荐

相似度计算的方法看起来比较简单就是实现出来，但是当数据较大的时候计算的时间消耗实在太大了，对每一个用户都需要多次遍历整个数据集来进行计算，矩阵分解的方法是当下更常使用的方法。

奇异值分解(Singular Value Decomposition，SVD)是矩阵分解中一个经典方法，接下来我们的推荐就可以SVD来进行计算，奇异值分解的基本出发点跟我们之前讲的隐语义模型有些类似都是将大矩阵转换成小矩阵的组合,基本形式如下图所示：

在这里插入图片描述

对矩阵进行SVD分解，将得到USV：

在这里插入图片描述

重新计算 USV的结果得到A2 来比较下A2和A的差异，看起来差异是有的，但是并不大，所以我们可以近似来代替：

在这里插入图片描述

在SVD中我们所需的数据是用户对商品的打分，但是我们现在的数据集中只有用户播放歌曲的情况并没有实际的打分值，所以我们还得自己来定义一下用户对每个歌曲的评分值。如果一个用户喜欢某个歌曲，那应该经常播放这个歌曲，相反如果不喜欢某个歌曲，那播放次数肯定就比较少了。
用户对歌曲的打分值，定义为：用户播放该歌曲数量/该用户播放总量。代码如下：

triplet_dataset_sub_song_merged_sum_df = triplet_dataset_sub_song_merged[['user','listen_count']].groupby('user').sum().reset_index()
triplet_dataset_sub_song_merged_sum_df.rename(columns={'listen_count':'total_listen_count'},inplace=True)
triplet_dataset_sub_song_merged = pd.merge(triplet_dataset_sub_song_merged,triplet_dataset_sub_song_merged_sum_df)
triplet_dataset_sub_song_merged.head()
1
2
3
4

在这里插入图片描述

triplet_dataset_sub_song_merged['fractional_play_count'] = triplet_dataset_sub_song_merged['listen_count']/triplet_dataset_sub_song_merged['total_listen_count']
1

triplet_dataset_sub_song_merged[triplet_dataset_sub_song_merged.user =='d6589314c0a9bcbca4fee0c93b14bc402363afea'][['user','song','listen_count','fractional_play_count']].head()
1

在这里插入图片描述

from scipy.sparse import coo_matrix

small_set = triplet_dataset_sub_song_merged
user_codes = small_set.user.drop_duplicates().reset_index()
song_codes = small_set.song.drop_duplicates().reset_index()
user_codes.rename(columns={'index':'user_index'}, inplace=True)
song_codes.rename(columns={'index':'song_index'}, inplace=True)
song_codes['so_index_value'] = list(song_codes.index)
user_codes['us_index_value'] = list(user_codes.index)
small_set = pd.merge(small_set,song_codes,how='left')
small_set = pd.merge(small_set,user_codes,how='left')
mat_candidate = small_set[['us_index_value','so_index_value','fractional_play_count']]
data_array = mat_candidate.fractional_play_count.values
row_array = mat_candidate.us_index_value.values
col_array = mat_candidate.so_index_value.values

data_sparse = coo_matrix((data_array, (row_array, col_array)),dtype=float)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

data_sparse
1

输出：

<99996x30000 sparse matrix of type '<class 'numpy.float64'>'
	with 10774558 stored elements in COOrdinate format>
1
2

上面代码先根据用户进行分组，计算每个用户的总的播放总量，然后用每首歌的播放总量相除，得到每首歌的分值，最后一列特征fractional_play_count就是用户对每首歌曲的评分值。
有了评分值之后就可以来构建矩阵了，这里有一些小问题需要处理一下，原始数据中无论是用户ID还是歌曲ID都是很长一串，这表达起来不太方便，需要重新对其制作索引。

user_codes[user_codes.user =='2a2f776cbac6df64d6cb505e7e834e01684673b6']
1

在这里插入图片描述

使用SVD方法来进行矩阵分解：

矩阵构造好了之后我们就要执行SVD矩阵分解了，这里还需要一些额外的工具包来帮助我们完成计算，scipy就是其中一个好帮手了，里面已经封装好了SVD计算方法。

import math as mt
from scipy.sparse.linalg import * #used for matrix multiplication
from scipy.sparse.linalg import svds
from scipy.sparse import csc_matrix
1
2
3
4

def compute_svd(urm, K):
    U, s, Vt = svds(urm, K)

    dim = (len(s), len(s))
    S = np.zeros(dim, dtype=np.float32)
    for i in range(0, len(s)):
        S[i,i] = mt.sqrt(s[i])

    U = csc_matrix(U, dtype=np.float32)
    S = csc_matrix(S, dtype=np.float32)
    Vt = csc_matrix(Vt, dtype=np.float32)
    
    return U, S, Vt

def compute_estimated_matrix(urm, U, S, Vt, uTest, K, test):
    rightTerm = S*Vt 
    max_recommendation = 250
    estimatedRatings = np.zeros(shape=(MAX_UID, MAX_PID), dtype=np.float16)
    recomendRatings = np.zeros(shape=(MAX_UID,max_recommendation ), dtype=np.float16)
    for userTest in uTest:
        prod = U[userTest, :]*rightTerm
        estimatedRatings[userTest, :] = prod.todense()
        recomendRatings[userTest, :] = (-estimatedRatings[userTest, :]).argsort()[:max_recommendation]
    return recomendRatings
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

在执行SVD的时候需要我们额外指定一个指标K值，其含义就是我们选择前多少个特征值来做近似代表，也就是S矩阵中的数量。如果K值较大整体的计算效率会慢一些但是会更接近真实结果，这个值还需要我们自己来衡量一下。

K=50
urm = data_sparse
MAX_PID = urm.shape[1]
MAX_UID = urm.shape[0]

U, S, Vt = compute_svd(urm, K)
1
2
3
4
5
6

这里我们选择K值等于50，其中PID表示我们最开始选择的部分歌曲，UID表示我们选择的部分用户。

接下来我们需要选择待测试用户了：

uTest = [4,5,6,7,8,873,23]

随便选择一些用户就好，这里表示用户的索引编号，接下来需要对每一个用户计算其对我们候选集中3W首歌曲的喜好程度，说白了就是估计他对这3W首歌的评分值应该等于多少，前面我们通过SVD矩阵分解已经计算所需各个小矩阵了，接下来把其还原回去就可以啦：

uTest = [4,5,6,7,8,873,23]

uTest_recommended_items = compute_estimated_matrix(urm, U, S, Vt, uTest, K, True)
1
2
3

for user in uTest:
    print("Recommendation for user with user id {}". format(user))
    rank_value = 1
    for i in uTest_recommended_items[user,0:10]:
        song_details = small_set[small_set.so_index_value == i].drop_duplicates('so_index_value')[['title','artist_name']]
        print("The number {} recommended song is {} BY {}".format(rank_value, list(song_details['title'])[0],list(song_details['artist_name'])[0]))
        rank_value+=1
1
2
3
4
5
6
7

输出：

Recommendation for user with user id 4
The number 1 recommended song is Fireflies BY Charttraxx Karaoke
The number 2 recommended song is Hey_ Soul Sister BY Train
The number 3 recommended song is OMG BY Usher featuring will.i.am
The number 4 recommended song is Lucky (Album Version) BY Jason Mraz & Colbie Caillat
The number 5 recommended song is Vanilla Twilight BY Owl City
The number 6 recommended song is Crumpshit BY Philippe Rochard
The number 7 recommended song is Billionaire [feat. Bruno Mars]  (Explicit Album Version) BY Travie McCoy
The number 8 recommended song is Love Story BY Taylor Swift
The number 9 recommended song is TULENLIEKKI BY M.A. Numminen
The number 10 recommended song is Use Somebody BY Kings Of Leon
Recommendation for user with user id 5
The number 1 recommended song is Sehr kosmisch BY Harmonia
The number 2 recommended song is Ain't Misbehavin BY Sam Cooke
The number 3 recommended song is Dog Days Are Over (Radio Edit) BY Florence + The Machine
The number 4 recommended song is Revelry BY Kings Of Leon
The number 5 recommended song is Undo BY BjÃ¶rk
The number 6 recommended song is Cosmic Love BY Florence + The Machine
The number 7 recommended song is Home BY Edward Sharpe & The Magnetic Zeros
The number 8 recommended song is You've Got The Love BY Florence + The Machine
The number 9 recommended song is Bring Me To Life BY Evanescence
The number 10 recommended song is Tighten Up BY The Black Keys
Recommendation for user with user id 6
The number 1 recommended song is Crumpshit BY Philippe Rochard
The number 2 recommended song is Marry Me BY Train
The number 3 recommended song is Hey_ Soul Sister BY Train
The number 4 recommended song is Lucky (Album Version) BY Jason Mraz & Colbie Caillat
The number 5 recommended song is One On One BY the bird and the bee
The number 6 recommended song is I Never Told You BY Colbie Caillat
The number 7 recommended song is Canada BY Five Iron Frenzy
The number 8 recommended song is Fireflies BY Charttraxx Karaoke
The number 9 recommended song is TULENLIEKKI BY M.A. Numminen
The number 10 recommended song is Bring Me To Life BY Evanescence
Recommendation for user with user id 7
The number 1 recommended song is Behind The Sea [Live In Chicago] BY Panic At The Disco
The number 2 recommended song is The City Is At War (Album Version) BY Cobra Starship
The number 3 recommended song is Dead Souls BY Nine Inch Nails
The number 4 recommended song is Una Confusion BY LU
The number 5 recommended song is Home BY Edward Sharpe & The Magnetic Zeros
The number 6 recommended song is Climbing Up The Walls BY Radiohead
The number 7 recommended song is Tighten Up BY The Black Keys
The number 8 recommended song is Tive Sim BY Cartola
The number 9 recommended song is West One (Shine On Me) BY The Ruts
The number 10 recommended song is Cosmic Love BY Florence + The Machine
Recommendation for user with user id 8
The number 1 recommended song is Undo BY BjÃ¶rk
The number 2 recommended song is Canada BY Five Iron Frenzy
The number 3 recommended song is Better To Reign In Hell BY Cradle Of Filth
The number 4 recommended song is Unite (2009 Digital Remaster) BY Beastie Boys
The number 5 recommended song is Behind The Sea [Live In Chicago] BY Panic At The Disco
The number 6 recommended song is Rockin' Around The Christmas Tree BY Brenda Lee
The number 7 recommended song is Devil's Slide BY Joe Satriani
The number 8 recommended song is Revelry BY Kings Of Leon
The number 9 recommended song is 16 Candles BY The Crests
The number 10 recommended song is Catch You Baby (Steve Pitron & Max Sanna Radio Edit) BY Lonnie Gordon
Recommendation for user with user id 873
The number 1 recommended song is The Scientist BY Coldplay
The number 2 recommended song is Yellow BY Coldplay
The number 3 recommended song is Clocks BY Coldplay
The number 4 recommended song is Fix You BY Coldplay
The number 5 recommended song is In My Place BY Coldplay
The number 6 recommended song is Shiver BY Coldplay
The number 7 recommended song is Speed Of Sound BY Coldplay
The number 8 recommended song is Creep (Explicit) BY Radiohead
The number 9 recommended song is Sparks BY Coldplay
The number 10 recommended song is Use Somebody BY Kings Of Leon
Recommendation for user with user id 23
The number 1 recommended song is Garden Of Eden BY Guns N' Roses
The number 2 recommended song is Don't Speak BY John DahlbÃ¤ck
The number 3 recommended song is Master Of Puppets BY Metallica
The number 4 recommended song is TULENLIEKKI BY M.A. Numminen
The number 5 recommended song is Bring Me To Life BY Evanescence
The number 6 recommended song is Kryptonite BY 3 Doors Down
The number 7 recommended song is Make Her Say BY Kid Cudi / Kanye West / Common
The number 8 recommended song is Night Village BY Deep Forest
The number 9 recommended song is Better To Reign In Hell BY Cradle Of Filth
The number 10 recommended song is Xanadu BY Olivia Newton-John;Electric Light Orchestra
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77

这里对每一个用户都得到了其对应的推荐结果，并且将结果按照得分值进行排序。

4. 总结

本文选择了音乐数据集来进行个性化推荐任务，首先对数据进行预处理和整合，选择两种方法分别完成推荐任务。在相似度计算中根据用户所听过的歌曲在候选集中选择与其最相似的歌曲，存在的问题就是计算时间消耗太多，每一个用户都需要重新计算一遍才能得出推荐结果。在SVD矩阵分解的方法中，我们首先构建评分矩阵，对其进行SVD分解，然后选择待推荐用户，还原得到其对所有歌曲的估测评分值，最后排序返回结果即可。

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/我家自动化/article/detail/379356