本项目使用文本卷积神经网络,并使用MovieLens
数据集完成电影推荐的任务。
推荐系统在日常的网络应用中无处不在,比如网上购物、网上买书、新闻app、社交网络、音乐网站、电影网站等等等等,有人的地方就有推荐。根据个人的喜好,相同喜好人群的习惯等信息进行个性化的内容推荐。比如打开新闻类的app,因为有了个性化的内容,每个人看到的新闻首页都是不一样的。
这当然是很有用的,在信息爆炸的今天,获取信息的途径和方式多种多样,人们花费时间最多的不再是去哪获取信息,而是要在众多的信息中寻找自己感兴趣的,这就是信息超载问题。为了解决这个问题,推荐系统应运而生。
协同过滤是推荐系统应用较广泛的技术,该方法搜集用户的历史记录、个人喜好等信息,计算与其他用户的相似度,利用相似用户的评价来预测目标用户对特定项目的喜好程度。优点是会给用户推荐未浏览过的项目,缺点呢,对于新用户来说,没有任何与商品的交互记录和个人喜好等信息,存在冷启动问题,导致模型无法找到相似的用户或商品。
为了解决冷启动的问题,通常的做法是对于刚注册的用户,要求用户先选择自己感兴趣的话题、群组、商品、性格、喜欢的音乐类型等信息,比如豆瓣FM:
下载数据集
运行下面代码把数据集
下载下来
- import pandas as pd
- from sklearn.model_selection import train_test_split
- import numpy as np
- from collections import Counter
- import tensorflow as tf
-
- import os
- import pickle
- import re
- from tensorflow.python.ops import math_ops
- from urllib.request import urlretrieve
- from os.path import isfile, isdir
- from tqdm import tqdm
- import zipfile
- import hashlib
-
- def _unzip(save_path, _, database_name, data_path):
- """
- 解压
- :param save_path: The path of the gzip files
- :param database_name: Name of database
- :param data_path: Path to extract to
- :param _: HACK - Used to have to same interface as _ungzip
- """
- print('Extracting {}...'.format(database_name))
- with zipfile.ZipFile(save_path) as zf:
- zf.extractall(data_path)
-
- def download_extract(database_name, data_path):
- """
- 下载提取数据
- :param database_name: Database name
- """
- DATASET_ML1M = 'ml-1m'
-
- if database_name == DATASET_ML1M:
- url = 'http://files.grouplens.org/datasets/movielens/ml-1m.zip'
- hash_code = 'c4d9eecfca2ab87c1945afe126590906'
- extract_path = os.path.join(data_path, 'ml-1m')
- save_path = os.path.join(data_path, 'ml-1m.zip')
- extract_fn = _unzip
-
- if os.path.exists(extract_path):
- print('Found {} Data'.format(database_name))
- return
-
- if not os.path.exists(data_path):
- os.makedirs(data_path)
-
- if not os.path.exists(save_path):
- with DLProgress(unit='B', unit_scale=True, miniters=1, desc='Downloading {}'.format(database_name)) as pbar:
- urlretrieve(
- url,
- save_path,
- pbar.hook)
-
- assert hashlib.md5(open(save_path, 'rb').read()).hexdigest() == hash_code, \
- '{} file is corrupted. Remove the file and try again.'.format(save_path)
-
- os.makedirs(extract_path)
- try:
- extract_fn(save_path, extract_path, database_name, data_path)
- except Exception as err:
- shutil.rmtree(extract_path) # Remove extraction folder if there is an error
- raise err
-
- print('Done.')
- # Remove compressed data
- # os.remove(save_path)
-
- class DLProgress(tqdm):
- """
- 下载时处理进度条
- """
- last_block = 0
-
- def hook(self, block_num=1, block_size=1, total_size=None):
- """
- A hook function that will be called once on establishment of the network connection and
- once after each block read thereafter.
- :param block_num: A count of blocks transferred so far
- :param block_size: Block size in bytes
- :param total_size: The total size of the file. This may be -1 on older FTP servers which do not return
- a file size in response to a retrieval request.
- """
- self.total = total_size
- self.update((block_num - self.last_block) * block_size)
- self.last_block = block_num
- data_dir = './'
- download_extract('ml-1m', data_dir)
- Extracting ml-1m...
- Done.
先来看看数据
本项目使用的是MovieLens 1M 数据集,包含6000个用户在近4000部电影上的1亿条评论。
数据集分为三个文件:
- 用户数据users.dat
- 电影数据movies.dat
- 评分数据ratings.dat
用户数据
- 用户ID
- 性别
- 年龄
- 职业ID
- 邮编
数据中的格式:UserID::Gender::Age::Occupation::Zip-code
- Gender is denoted by a "M" for male and "F" for female
Age is chosen from the following ranges:
- 1: "Under 18"
- 18: "18-24"
- 25: "25-34"
- 35: "35-44"
- 45: "45-49"
- 50: "50-55"
- 56: "56+"
Occupation is chosen from the following choices:
- 0: "other" or not specified
- 1: "academic/educator"
- 2: "artist"
- 3: "clerical/admin"
- 4: "college/grad student"
- 5: "customer service"
- 6: "doctor/health care"
- 7: "executive/managerial"
- 8: "farmer"
- 9: "homemaker"
- 10: "K-12 student"
- 11: "lawyer"
- 12: "programmer"
- 13: "retired"
- 14: "sales/marketing"
- 15: "scientist"
- 16: "self-employed"
- 17: "technician/engineer"
- 18: "tradesman/craftsman"
- 19: "unemployed"
- 20: "writer"
- users_title = ['UserID', 'Gender', 'Age', 'OccupationID', 'Zip-code']
- users = pd.read_table('./ml-1m/users.dat', sep='::', header=None, names=users_title, engine = 'python')
- users.head()
UserID | Gender | Age | OccupationID | Zip-code | |
---|---|---|---|---|---|
0 | 1 | F | 1 | 10 | 48067 |
1 | 2 | M | 56 | 16 | 70072 |
2 | 3 | M | 25 | 15 | 55117 |
3 | 4 | M | 45 | 7 | 02460 |
4 | 5 | M | 25 | 20 | 55455 |
可以看出UserID、Gender、Age和Occupation都是类别字段,其中邮编字段是我们不使用的。
电影数据
- 电影ID
- 电影名
- 电影风格
数据中的格式:MovieID::Title::Genres
- Titles are identical to titles provided by the IMDB (including
year of release) Genres are pipe-separated and are selected from the following genres:
- Action
- Adventure
- Animation
- Children's
- Comedy
- Crime
- Documentary
- Drama
- Fantasy
- Film-Noir
- Horror
- Musical
- Mystery
- Romance
- Sci-Fi
- Thriller
- War
- Western
- movies_title = ['MovieID', 'Title', 'Genres']
- movies = pd.read_table('./ml-1m/movies.dat', sep='::', header=None, names=movies_title, engine = 'python')
- movies.head()
MovieID | Title | Genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Animation|Children's|Comedy |
1 | 2 | Jumanji (1995) | Adventure|Children's|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
MovieID是类别字段,Title是文本,Genres也是类别字段
评分数据
- 用户ID
- 电影ID
- 评分
- 时间戳
数据中的格式:UserID::MovieID::Rating::Timestamp
- UserIDs range between 1 and 6040
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Timestamp is represented in seconds since the epoch as returned by time(2)
- Each user has at least 20 ratings
- ratings_title = ['UserID','MovieID', 'Rating', 'timestamps']
- ratings = pd.read_table('./ml-1m/ratings.dat', sep='::', header=None, names=ratings_title, engine = 'python')
- ratings.head()
UserID | MovieID | Rating | timestamps | |
---|---|---|---|---|
0 | 1 | 1193 | 5 | 978300760 |
1 | 1 | 661 | 3 | 978302109 |
2 | 1 | 914 | 3 | 978301968 |
3 | 1 | 3408 | 4 | 978300275 |
4 | 1 | 2355 | 5 | 978824291 |
评分字段Rating就是我们要学习的targets,时间戳字段我们不使用。
来说说数据预处理
- UserID、Occupation和MovieID不用变。
- Gender字段:需要将‘F’和‘M’转换成0和1。
- Age字段:要转成7个连续数字0~6。
- Genres字段:是分类字段,要转成数字。首先将Genres中的类别转成字符串到数字的字典,然后再将每个电影的Genres字段转成数字列表,因为有些电影是多个Genres的组合。
- Title字段:处理方式跟Genres字段一样,首先创建文本到数字的字典,然后将Title中的描述转成数字的列表。另外Title中的年份也需要去掉。
- Genres和Title字段需要将长度统一,这样在神经网络中方便处理。空白部分用‘< PAD >’对应的数字填充。
实现数据预处理
- def load_data():
- """
- 从文件中加载数据集
- """
- # 读取User数据
- users_title = ['UserID', 'Gender', 'Age', 'JobID', 'Zip-code']
- users = pd.read_table('./ml-1m/users.dat', sep='::', header=None, names=users_title, engine = 'python')
- users = users.filter(regex='UserID|Gender|Age|JobID')
- users_orig = users.values
-
- # 改变User数据中性别和年龄
- gender_map = {'F':0, 'M':1}
- users['Gender'] = users['Gender'].map(gender_map)
-
- age_map = {val:ii for ii,val in enumerate(set(users['Age']))}
- users['Age'] = users['Age'].map(age_map)
-
- # 读取Movie数据集
- movies_title = ['MovieID', 'Title', 'Genres']
- movies = pd.read_table('./ml-1m/movies.dat', sep='::', header=None, names=movies_title, engine = 'python')
- movies_orig = movies.values
- # 将Title中的年份去掉
- pattern = re.compile(r'^(.*)\((\d+)\)$')
-
- title_map = {val:pattern.match(val).group(1) for ii,val in enumerate(set(movies['Title']))}
- movies['Title'] = movies['Title'].map(title_map)
-
- # 电影类型转数字字典
- genres_set = set()
- for val in movies['Genres'].str.split('|'):
- genres_set.update(val)
-
- genres_set.add('<PAD>')
- genres2int = {val:ii for ii, val in enumerate(genres_set)}
-
- # 将电影类型转成等长数字列表,长度是18
- genres_map = {val:[genres2int[row] for row in val.split('|')] for ii,val in enumerate(set(movies['Genres']))}
-
- for key in genres_map:
- for cnt in range(max(genres2int.values()) - len(genres_map[key])):
- genres_map[key].insert(len(genres_map[key]) + cnt,genres2int['<PAD>'])
-
- movies['Genres'] = movies['Genres'].map(genres_map)
-
- # 电影Title转数字字典
- title_set = set()
- for val in movies['Title'].str.split():
- title_set.update(val)
-
- title_set.add('<PAD>')
- title2int = {val:ii for ii, val in enumerate(title_set)}
-
- # 将电影Title转成等长数字列表,长度是15
- title_count = 15
- title_map = {val:[title2int[row] for row in val.split()] for ii,val in enumerate(set(movies['Title']))}
-
- for key in title_map:
- for cnt in range(title_count - len(title_map[key])):
- title_map[key].insert(len(title_map[key]) + cnt,title2int['<PAD>'])
-
- movies['Title'] = movies['Title'].map(title_map)
-
- # 读取评分数据集
- ratings_title = ['UserID','MovieID', 'ratings', 'timestamps']
- ratings = pd.read_table('./ml-1m/ratings.dat', sep='::', header=None, names=ratings_title, engine = 'python')
- ratings = ratings.filter(regex='UserID|MovieID|ratings')
-
- # 合并三个表
- data = pd.merge(pd.merge(ratings, users), movies)
-
- # 将数据分成X和y两张表
- target_fields = ['ratings']
- features_pd, targets_pd = data.drop(target_fields, axis=1), data[target_fields]
-
- features = features_pd.values
- targets_values = targets_pd.values
-
- return title_count, title_set, genres2int, features, targets_values, ratings, users, movies, data, movies_orig, users_orig
加载数据并保存到本地
- title_count:Title字段的长度(15)
- title_set:Title文本的集合
- genres2int:电影类型转数字的字典
- features:是输入X
- targets_values:是学习目标y
- ratings:评分数据集的Pandas对象
- users:用户数据集的Pandas对象
- movies:电影数据的Pandas对象
- data:三个数据集组合在一起的Pandas对象
- movies_orig:没有做数据处理的原始电影数据
- users_orig:没有做数据处理的原始用户数据
- # 加载数据
- title_count, title_set, genres2int, features, targets_values, ratings, users, movies, data, movies_orig, users_orig = load_data()
-
- # 存入文件中
- pickle.dump((title_count, title_set, genres2int, features, targets_values, ratings, users, movies, data, movies_orig, users_orig), open('preprocess.p', 'wb'))
预处理后的数据
users.head()
UserID | Gender | Age | JobID | |
---|---|---|---|---|
0 | 1 | 0 | 0 | 10 |
1 | 2 | 1 | 5 | 16 |
2 | 3 | 1 | 6 | 15 |
3 | 4 | 1 | 2 | 7 |
4 | 5 | 1 | 6 | 20 |
movies.head()
MovieID | Title | Genres | |
---|---|---|---|
0 | 1 | [310, 2184, 634, 634, 634, 634, 634, 634, 634,... | [0, 18, 7, 17, 17, 17, 17, 17, 17, 17, 17, 17,... |
1 | 2 | [1182, 634, 634, 634, 634, 634, 634, 634, 634,... | [3, 18, 8, 17, 17, 17, 17, 17, 17, 17, 17, 17,... |
2 | 3 | [5011, 4744, 2629, 634, 634, 634, 634, 634, 63... | [7, 9, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17,... |
3 | 4 | [4095, 1535, 1886, 634, 634, 634, 634, 634, 63... | [7, 5, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17,... |
4 | 5 | [3563, 1725, 3790, 3727, 838, 343, 634, 634, 6... | [7, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17... |
movies.values[0]
- array([1,
- list([310, 2184, 634, 634, 634, 634, 634, 634, 634, 634, 634, 634, 634, 634, 634]),
- list([0, 18, 7, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17])],
- dtype=object)
从本地读取数据
title_count, title_set, genres2int, features, targets_values, ratings, users, movies, data, movies_orig, users_orig = pickle.load(open('preprocess.p', mode='rb'))
模型设计
通过研究数据集中的字段类型,我们发现有一些是类别字段,通常的处理是将这些字段转成one hot编码,但是像UserID、MovieID这样的字段就会变成非常的稀疏,输入的维度急剧膨胀,这是我们不愿意见到的,毕竟我这小笔记本不像大厂动辄能处理数以亿计维度的输入:)
所以在预处理数据时将这些字段转成了数字,我们用这个数字当做嵌入矩阵的索引,在网络的第一层使用了嵌入层,维度是(N,32)和(N,16)。
电影类型的处理要多一步,有时一个电影有多个电影类型,这样从嵌入矩阵索引出来是一个(n,32)的矩阵,因为有多个类型嘛,我们要将这个矩阵求和,变成(1,32)的向量。
电影名的处理比较特殊,没有使用循环神经网络,而是用了文本卷积网络,下文会进行说明。
从嵌入层索引出特征以后,将各特征传入全连接层,将输出再次传入全连接层,最终分别得到(1,200)的用户特征和电影特征两个特征向量。
我们的目的就是要训练出用户特征和电影特征,在实现推荐功能时使用。得到这两个特征以后,就可以选择任意的方式来拟合评分了。我使用了两种方式,一个是上图中画出的将两个特征做向量乘法,将结果与真实评分做回归,采用MSE优化损失。因为本质上这是一个回归问题,另一种方式是,将两个特征作为输入,再次传入全连接层,输出一个值,将输出值回归到真实评分,采用MSE优化损失。
实际上第二个方式的MSE loss在0.8附近,第一个方式在1附近,5次迭代的结果。
文本卷积网络
网络看起来像下面这样
图片来自Kim Yoon的论文:Convolutional Neural Networks for Sentence Classification
将卷积神经网络用于文本的文章建议你阅读Understanding Convolutional Neural Networks for NLP
网络的第一层是词嵌入层,由每一个单词的嵌入向量组成的嵌入矩阵。下一层使用多个不同尺寸(窗口大小)的卷积核在嵌入矩阵上做卷积,窗口大小指的是每次卷积覆盖几个单词。这里跟对图像做卷积不太一样,图像的卷积通常用2x2、3x3、5x5之类的尺寸,而文本卷积要覆盖整个单词的嵌入向量,所以尺寸是(单词数,向量维度),比如每次滑动3个,4个或者5个单词。第三层网络是max pooling得到一个长向量,最后使用dropout做正则化,最终得到了电影Title的特征。
辅助函数
- import tensorflow as tf
- import os
- import pickle
-
- def save_params(params):
- """
- 保存参数到文件中
- """
- pickle.dump(params, open('params.p', 'wb'))
-
-
- def load_params():
- """
- 从文件中加载参数
- """
- return pickle.load(open('params.p', mode='rb'))
编码实现
- # 嵌入矩阵的维度
- embed_dim = 32
- # 用户ID个数
- uid_max = max(features.take(0,1)) + 1 # 6040
- # 性别个数
- gender_max = max(features.take(2,1)) + 1 # 1 + 1 = 2
- # 年龄类别个数
- age_max = max(features.take(3,1)) + 1 # 6 + 1 = 7
- # 职业个数
- job_max = max(features.take(4,1)) + 1# 20 + 1 = 21
-
- # 电影ID个数
- movie_id_max = max(features.take(1,1)) + 1 # 3952
- # 电影类型个数
- movie_categories_max = max(genres2int.values()) + 1 # 18 + 1 = 19
- # 电影名单词个数
- movie_title_max = len(title_set) # 5216
-
- # 对电影类型嵌入向量做加和操作的标志,考虑过使用mean做平均,但是没实现mean
- combiner = "sum"
-
- # 电影名长度
- sentences_size = title_count # = 15
- # 文本卷积滑动窗口,分别滑动2, 3, 4, 5个单词
- window_sizes = {2, 3, 4, 5}
- # 文本卷积核数量
- filter_num = 8
-
- # 电影ID转下标的字典,数据集中电影ID跟下标不一致,比如第5行的数据电影ID不一定是5
- movieid2idx = {val[0]:i for i, val in enumerate(movies.values)}
超参
- # Number of Epochs
- num_epochs = 5
- # Batch Size
- batch_size = 256
-
- dropout_keep = 0.5
- # Learning Rate
- learning_rate = 0.0001
- # Show stats for every n number of batches
- show_every_n_batches = 20
-
- save_dir = './save'
输入
定义输入的占位符
- def get_inputs():
- uid = tf.placeholder(tf.int32, [None, 1], name="uid")
- user_gender = tf.placeholder(tf.int32, [None, 1], name="user_gender")
- user_age = tf.placeholder(tf.int32, [None, 1], name="user_age")
- user_job = tf.placeholder(tf.int32, [None, 1], name="user_job")
-
- movie_id = tf.placeholder(tf.int32, [None, 1], name="movie_id")
- movie_categories = tf.placeholder(tf.int32, [None, 18], name="movie_categories")
- movie_titles = tf.placeholder(tf.int32, [None, 15], name="movie_titles")
- targets = tf.placeholder(tf.int32, [None, 1], name="targets")
- LearningRate = tf.placeholder(tf.float32, name = "LearningRate")
- dropout_keep_prob = tf.placeholder(tf.float32, name = "dropout_keep_prob")
- return uid, user_gender, user_age, user_job, movie_id, movie_categories, movie_titles, targets, LearningRate, dropout_keep_prob
构建神经网络
定义User的嵌入矩阵
- def get_user_embedding(uid, user_gender, user_age, user_job):
- with tf.name_scope("user_embedding"):
- uid_embed_matrix = tf.Variable(tf.random_uniform([uid_max, embed_dim], -1, 1), name = "uid_embed_matrix")
- uid_embed_layer = tf.nn.embedding_lookup(uid_embed_matrix, uid, name = "uid_embed_layer")
-
- gender_embed_matrix = tf.Variable(tf.random_uniform([gender_max, embed_dim // 2], -1, 1), name= "gender_embed_matrix")
- gender_embed_layer = tf.nn.embedding_lookup(gender_embed_matrix, user_gender, name = "gender_embed_layer")
-
- age_embed_matrix = tf.Variable(tf.random_uniform([age_max, embed_dim // 2], -1, 1), name="age_embed_matrix")
- age_embed_layer = tf.nn.embedding_lookup(age_embed_matrix, user_age, name="age_embed_layer")
-
- job_embed_matrix = tf.Variable(tf.random_uniform([job_max, embed_dim // 2], -1, 1), name = "job_embed_matrix")
- job_embed_layer = tf.nn.embedding_lookup(job_embed_matrix, user_job, name = "job_embed_layer")
- return uid_embed_layer, gender_embed_layer, age_embed_layer, job_embed_layer
将User的嵌入矩阵一起全连接生成User的特征
- def get_user_feature_layer(uid_embed_layer, gender_embed_layer, age_embed_layer, job_embed_layer):
- with tf.name_scope("user_fc"):
- #第一层全连接
- uid_fc_layer = tf.layers.dense(uid_embed_layer, embed_dim, name = "uid_fc_layer", activation=tf.nn.relu)
- gender_fc_layer = tf.layers.dense(gender_embed_layer, embed_dim, name = "gender_fc_layer", activation=tf.nn.relu)
- age_fc_layer = tf.layers.dense(age_embed_layer, embed_dim, name ="age_fc_layer", activation=tf.nn.relu)
- job_fc_layer = tf.layers.dense(job_embed_layer, embed_dim, name = "job_fc_layer", activation=tf.nn.relu)
-
- #第二层全连接
- user_combine_layer = tf.concat([uid_fc_layer, gender_fc_layer, age_fc_layer, job_fc_layer], 2) #(?, 1, 128)
- user_combine_layer = tf.contrib.layers.fully_connected(user_combine_layer, 200, tf.tanh) #(?, 1, 200)
-
- user_combine_layer_flat = tf.reshape(user_combine_layer, [-1, 200])
- return user_combine_layer, user_combine_layer_flat
定义Movie ID的嵌入矩阵
- def get_movie_id_embed_layer(movie_id):
- with tf.name_scope("movie_embedding"):
- movie_id_embed_matrix = tf.Variable(tf.random_uniform([movie_id_max, embed_dim], -1, 1), name = "movie_id_embed_matrix")
- movie_id_embed_layer = tf.nn.embedding_lookup(movie_id_embed_matrix, movie_id, name = "movie_id_embed_layer")
- return movie_id_embed_layer
对电影类型的多个嵌入向量做加和
- def get_movie_categories_layers(movie_categories):
- with tf.name_scope("movie_categories_layers"):
- movie_categories_embed_matrix = tf.Variable(tf.random_uniform([movie_categories_max, embed_dim], -1, 1), name = "movie_categories_embed_matrix")
- movie_categories_embed_layer = tf.nn.embedding_lookup(movie_categories_embed_matrix, movie_categories, name = "movie_categories_embed_layer")
- if combiner == "sum":
- movie_categories_embed_layer = tf.reduce_sum(movie_categories_embed_layer, axis=1, keep_dims=True)
- # elif combiner == "mean":
-
- return movie_categories_embed_layer
Movie Title的文本卷积网络实现
- def get_movie_cnn_layer(movie_titles):
- #从嵌入矩阵中得到电影名对应的各个单词的嵌入向量
- with tf.name_scope("movie_embedding"):
- movie_title_embed_matrix = tf.Variable(tf.random_uniform([movie_title_ma