赞
踩
目录
GroupLens实验室提供了一些从MoviesLens用户那里收集的20世纪90年代末到21世纪初的电影评分数据的集合。浙西额数据提供了电影的评分、流派、年份和观众数据(年龄、邮编、性别、职业)。 MovisLens1M数据集包含6000个用户对4000部电影的100万个评分。数据分布在三个表格之中:分别包含评分、用户信息和电影信息。
这些数据都是dat文件格式,可以通过pandas.read_table将各个表分别读到一个pandas DataFrame对象中:
- unames = ["user_id", "gender", "age", "occupation", "zip"]
- users = pd.read_table("datasets/movielens/users.dat", sep="::",
- header=None, names=unames, engine="python")
-
- rnames = ["user_id", "movie_id", "rating", "timestamp"]
- ratings = pd.read_table("datasets/movielens/ratings.dat", sep="::",
- header=None, names=rnames, engine="python")
-
- mnames = ["movie_id", "title", "genres"]
- movies = pd.read_table("datasets/movielens/movies.dat", sep="::",
- header=None, names=mnames, engine="python")
查看信息,加载前5行验证一下数据加载工作是否顺利,代码及运行结果:
- users.head(5)
- ratings.head(5)
- movies.head(5)
- ratings
user_id | movie_id | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 1193 | 5 | 978300760 |
1 | 1 | 661 | 3 | 978302109 |
2 | 1 | 914 | 3 | 978301968 |
3 | 1 | 3408 | 4 | 978300275 |
4 | 1 | 2355 | 5 | 978824291 |
... | ... | ... | ... | ... |
1000204 | 6040 | 1091 | 1 | 956716541 |
1000205 | 6040 | 1094 | 5 | 956704887 |
1000206 | 6040 | 562 | 5 | 956704746 |
1000207 | 6040 | 1096 | 4 | 956715648 |
1000208 | 6040 | 1097 | 4 | 956715569 |
将所有的数据都合并到一个表中的话,问题就简单多了。我们先用pandas的merge函数将ratings和users合并到一起,然后再将movies也合并进去。pandas会根据列名的重叠情况推断出哪些列是合并(或连接)键.
- data = pd.merge(pd.merge(ratings, users), movies)
- data
- data.iloc[0]
利用数据透视表(pivot_table方法),可计算同一部电影不同性别的平均评分:
- mean_ratings = data.pivot_table("rating", index="title",
- columns="gender", aggfunc="mean")
同样加载前5行验证一下数据加载工作是否顺利:
mean_ratings.head(5)
gender | F | M |
---|---|---|
title | ||
$1,000,000 Duck (1971) | 3.375000 | 2.761905 |
'Night Mother (1986) | 3.388889 | 3.352941 |
'Til There Was You (1997) | 2.675676 | 2.733333 |
'burbs, The (1989) | 2.793478 | 2.962085 |
...And Justice for All (1979) | 3.828571 | 3.689024 |
过滤掉评分数据不够250条的电影(这个数字可以自己设定)。
为了达到这个目的,我们先对title进行分组,然后利用size()得到一个含有各电影分组大小的Series对象。
active_titles中的电影,都是评论是大于250条以上的。我们可以用这些标题作为索引,从mean_ratings中选出这些评论大于250条的电影。
- #分组
- ratings_by_title = data.groupby("title").size()
- ratings_by_title.head()
- #数据过滤
- active_titles = ratings_by_title.index[ratings_by_title >= 250]
- active_titles
- #数据筛选
- mean_ratings = mean_ratings.loc[active_titles]
- mean_ratings
gender | F | M |
---|---|---|
title | ||
'burbs, The (1989) | 2.793478 | 2.962085 |
10 Things I Hate About You (1999) | 3.646552 | 3.311966 |
101 Dalmatians (1961) | 3.791444 | 3.500000 |
101 Dalmatians (1996) | 3.240000 | 2.911215 |
12 Angry Men (1957) | 4.184397 | 4.328421 |
... | ... | ... |
Young Guns (1988) | 3.371795 | 3.425620 |
Young Guns II (1990) | 2.934783 | 2.904025 |
Young Sherlock Holmes (1985) | 3.514706 | 3.363344 |
Zero Effect (1998) | 3.864407 | 3.723140 |
eXistenZ (1999) | 3.098592 | 3.289086 |
查看女性观众喜欢的电影,可以按F列进行降序操作:
- top_female_ratings = mean_ratings.sort_values("F", ascending=False)
- top_female_ratings.head()
gender | F | M |
---|---|---|
title | ||
Close Shave, A (1995) | 4.644444 | 4.473795 |
Wrong Trousers, The (1993) | 4.588235 | 4.478261 |
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) | 4.572650 | 4.464589 |
Wallace & Gromit: The Best of Aardman Animation (1996) | 4.563107 | 4.385075 |
Schindler's List (1993) | 4.562602 | 4.491415 |
假设我们想要找出男性和女性观众分歧最大的电影。一个办法是给mean_ratings加上一个用于存放平均得分之差的列,并对其进行排序:
- mean_ratings["diff"] = mean_ratings["M"] - mean_ratings["F"]
- sorted_by_diff = mean_ratings.sort_values("diff")
- sorted_by_diff.head()
gender | F | M | diff |
---|---|---|---|
title | |||
Dirty Dancing (1987) | 3.790378 | 2.959596 | -0.830782 |
Jumpin' Jack Flash (1986) | 3.254717 | 2.578358 | -0.676359 |
Grease (1978) | 3.975265 | 3.367041 | -0.608224 |
Little Women (1994) | 3.870588 | 3.321739 | -0.548849 |
Steel Magnolias (1989) | 3.901734 | 3.365957 | -0.535777 |
对行进行反序操作,并取出前15行,得到的则是男性更喜欢,而女性评价较低的电影:
sorted_by_diff[::-1].head()
gender | F | M | diff |
---|---|---|---|
title | |||
Good, The Bad and The Ugly, The (1966) | 3.494949 | 4.221300 | 0.726351 |
Kentucky Fried Movie, The (1977) | 2.878788 | 3.555147 | 0.676359 |
Dumb & Dumber (1994) | 2.697987 | 3.336595 | 0.638608 |
Longest Day, The (1962) | 3.411765 | 4.031447 | 0.619682 |
Cable Guy, The (1996) | 2.250000 | 2.863787 | 0.613787 |
最后,得到所以人分歧最大的电影
- rating_std_by_title = data.groupby("title")["rating"].std()
- rating_std_by_title = rating_std_by_title.loc[active_titles]
- rating_std_by_title.head()
- rating_std_by_title.sort_values(ascending=False)[:10]
给电影增加风格genres,s加上一个用于存放电影风格的列,方便后续统计计算:
- movies["genres"].head()
- movies["genres"].head().str.split("|")
- movies["genre"] = movies.pop("genres").str.split("|")
- movies.head()
movie_id | title | genre | |
---|---|---|---|
0 | 1 | Toy Story (1995) | [Animation, Children's, Comedy] |
1 | 2 | Jumanji (1995) | [Adventure, Children's, Fantasy] |
2 | 3 | Grumpier Old Men (1995) | [Comedy, Romance] |
3 | 4 | Waiting to Exhale (1995) | [Comedy, Drama] |
4 | 5 | Father of the Bride Part II (1995) | [Comedy] |
对genre进行拆分,拆分为单项:
- movies_exploded = movies.explode("genre")
- movies_exploded[:10]
movie_id | title | genre | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Animation |
0 | 1 | Toy Story (1995) | Children's |
0 | 1 | Toy Story (1995) | Comedy |
1 | 2 | Jumanji (1995) | Adventure |
1 | 2 | Jumanji (1995) | Children's |
1 | 2 | Jumanji (1995) | Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy |
2 | 3 | Grumpier Old Men (1995) | Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy |
3 | 4 | Waiting to Exhale (1995) | Drama |
根据不同年龄段,统计喜欢风格比例:
- ratings_with_genre = pd.merge(pd.merge(movies_exploded, ratings), users)
- ratings_with_genre.iloc[0]
- genre_ratings = (ratings_with_genre.groupby(["genre", "age"])
- ["rating"].mean()
- .unstack("age"))
- genre_ratings[:10]
age | 1 | 18 | 25 | 35 | 45 | 50 | 56 |
---|---|---|---|---|---|---|---|
genre | |||||||
Action | 3.506385 | 3.447097 | 3.453358 | 3.538107 | 3.528543 | 3.611333 | 3.610709 |
Adventure | 3.449975 | 3.408525 | 3.443163 | 3.515291 | 3.528963 | 3.628163 | 3.649064 |
Animation | 3.476113 | 3.624014 | 3.701228 | 3.740545 | 3.734856 | 3.780020 | 3.756233 |
Children's | 3.241642 | 3.294257 | 3.426873 | 3.518423 | 3.527593 | 3.556555 | 3.621822 |
Comedy | 3.497491 | 3.460417 | 3.490385 | 3.561984 | 3.591789 | 3.646868 | 3.650949 |
Crime | 3.710170 | 3.668054 | 3.680321 | 3.733736 | 3.750661 | 3.810688 | 3.832549 |
Documentary | 3.730769 | 3.865865 | 3.946690 | 3.953747 | 3.966521 | 3.908108 | 3.961538 |
Drama | 3.794735 | 3.721930 | 3.726428 | 3.782512 | 3.784356 | 3.878415 | 3.933465 |
Fantasy | 3.317647 | 3.353778 | 3.452484 | 3.482301 | 3.532468 | 3.581570 | 3.532700 |
Film-Noir | 4.145455 | 3.997368 | 4.058725 | 4.064910 | 4.105376 | 4.175401 | 4.125932 |
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。