当前位置:   article > 正文

利用python进行数据分析—MoviesLens 1M数据集_movieslens数据集毕业论文与毕业论文数据集和过程做法几乎完全一样

movieslens数据集毕业论文与毕业论文数据集和过程做法几乎完全一样

目录

MoviesLens 1M数据集

1.1数据预处理

1.2 计算电影平均得分

1.3 数据过滤

1.4 数据排序

 1.5 计算评分分歧

1.6 统计电影风格类型 

MoviesLens 1M数据集

GroupLens实验室提供了一些从MoviesLens用户那里收集的20世纪90年代末到21世纪初的电影评分数据的集合。浙西额数据提供了电影的评分、流派、年份和观众数据(年龄、邮编、性别、职业)。 MovisLens1M数据集包含6000个用户对4000部电影的100万个评分。数据分布在三个表格之中:分别包含评分、用户信息和电影信息。

1.1数据预处理

这些数据都是dat文件格式,可以通过pandas.read_table将各个表分别读到一个pandas DataFrame对象中:

  1. unames = ["user_id", "gender", "age", "occupation", "zip"]
  2. users = pd.read_table("datasets/movielens/users.dat", sep="::",
  3. header=None, names=unames, engine="python")
  4. rnames = ["user_id", "movie_id", "rating", "timestamp"]
  5. ratings = pd.read_table("datasets/movielens/ratings.dat", sep="::",
  6. header=None, names=rnames, engine="python")
  7. mnames = ["movie_id", "title", "genres"]
  8. movies = pd.read_table("datasets/movielens/movies.dat", sep="::",
  9. header=None, names=mnames, engine="python")

 查看信息,加载前5行验证一下数据加载工作是否顺利,代码及运行结果:

  1. users.head(5)
  2. ratings.head(5)
  3. movies.head(5)
  4. ratings

user_idmovie_idratingtimestamp
0111935978300760
116613978302109
219143978301968
3134084978300275
4123555978824291
...............
1000204604010911956716541
1000205604010945956704887
100020660405625956704746
1000207604010964956715648
1000208604010974956715569

将所有的数据都合并到一个表中的话,问题就简单多了。我们先用pandas的merge函数将ratings和users合并到一起,然后再将movies也合并进去。pandas会根据列名的重叠情况推断出哪些列是合并(或连接)键.

  1. data = pd.merge(pd.merge(ratings, users), movies)
  2. data
  3. data.iloc[0]

1.2 计算电影平均得分

利用数据透视表(pivot_table方法),可计算同一部电影不同性别的平均评分:

  1. mean_ratings = data.pivot_table("rating", index="title",
  2. columns="gender", aggfunc="mean")

同样加载前5行验证一下数据加载工作是否顺利:

mean_ratings.head(5)
genderFM
title
$1,000,000 Duck (1971)3.3750002.761905
'Night Mother (1986)3.3888893.352941
'Til There Was You (1997)2.6756762.733333
'burbs, The (1989)2.7934782.962085
...And Justice for All (1979)3.8285713.689024

1.3 数据过滤

过滤掉评分数据不够250条的电影(这个数字可以自己设定)。

为了达到这个目的,我们先对title进行分组,然后利用size()得到一个含有各电影分组大小的Series对象。

active_titles中的电影,都是评论是大于250条以上的。我们可以用这些标题作为索引,从mean_ratings中选出这些评论大于250条的电影。

  1. #分组
  2. ratings_by_title = data.groupby("title").size()
  3. ratings_by_title.head()
  4. #数据过滤
  5. active_titles = ratings_by_title.index[ratings_by_title >= 250]
  6. active_titles
  1. #数据筛选
  2. mean_ratings = mean_ratings.loc[active_titles]
  3. mean_ratings
genderFM
title
'burbs, The (1989)2.7934782.962085
10 Things I Hate About You (1999)3.6465523.311966
101 Dalmatians (1961)3.7914443.500000
101 Dalmatians (1996)3.2400002.911215
12 Angry Men (1957)4.1843974.328421
.........
Young Guns (1988)3.3717953.425620
Young Guns II (1990)2.9347832.904025
Young Sherlock Holmes (1985)3.5147063.363344
Zero Effect (1998)3.8644073.723140
eXistenZ (1999)3.0985923.289086

1.4 数据排序

查看女性观众喜欢的电影,可以按F列进行降序操作:

  1. top_female_ratings = mean_ratings.sort_values("F", ascending=False)
  2. top_female_ratings.head()
genderFM
title
Close Shave, A (1995)4.6444444.473795
Wrong Trousers, The (1993)4.5882354.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)4.5726504.464589
Wallace & Gromit: The Best of Aardman Animation (1996)4.5631074.385075
Schindler's List (1993)4.5626024.491415

 1.5 计算评分分歧

假设我们想要找出男性和女性观众分歧最大的电影。一个办法是给mean_ratings加上一个用于存放平均得分之差的列,并对其进行排序:

  1. mean_ratings["diff"] = mean_ratings["M"] - mean_ratings["F"]
  2. sorted_by_diff = mean_ratings.sort_values("diff")
  3. sorted_by_diff.head()
genderFMdiff
title
Dirty Dancing (1987)3.7903782.959596-0.830782
Jumpin' Jack Flash (1986)3.2547172.578358-0.676359
Grease (1978)3.9752653.367041-0.608224
Little Women (1994)3.8705883.321739-0.548849
Steel Magnolias (1989)3.9017343.365957-0.535777

 对行进行反序操作,并取出前15行,得到的则是男性更喜欢,而女性评价较低的电影:

sorted_by_diff[::-1].head()
genderFMdiff
title
Good, The Bad and The Ugly, The (1966)3.4949494.2213000.726351
Kentucky Fried Movie, The (1977)2.8787883.5551470.676359
Dumb & Dumber (1994)2.6979873.3365950.638608
Longest Day, The (1962)3.4117654.0314470.619682
Cable Guy, The (1996)2.2500002.8637870.613787

 最后,得到所以人分歧最大的电影

  1. rating_std_by_title = data.groupby("title")["rating"].std()
  2. rating_std_by_title = rating_std_by_title.loc[active_titles]
  3. rating_std_by_title.head()
  4. rating_std_by_title.sort_values(ascending=False)[:10]

1.6 统计电影风格类型 

给电影增加风格genres,s加上一个用于存放电影风格的列,方便后续统计计算:

  1. movies["genres"].head()
  2. movies["genres"].head().str.split("|")
  3. movies["genre"] = movies.pop("genres").str.split("|")
  4. movies.head()
movie_idtitlegenre
01Toy Story (1995)[Animation, Children's, Comedy]
12Jumanji (1995)[Adventure, Children's, Fantasy]
23Grumpier Old Men (1995)[Comedy, Romance]
34Waiting to Exhale (1995)[Comedy, Drama]
45Father of the Bride Part II (1995)[Comedy]

 对genre进行拆分,拆分为单项:

  1. movies_exploded = movies.explode("genre")
  2. movies_exploded[:10]
movie_idtitlegenre
01Toy Story (1995)Animation
01Toy Story (1995)Children's
01Toy Story (1995)Comedy
12Jumanji (1995)Adventure
12Jumanji (1995)Children's
12Jumanji (1995)Fantasy
23Grumpier Old Men (1995)Comedy
23Grumpier Old Men (1995)Romance
34Waiting to Exhale (1995)Comedy
34Waiting to Exhale (1995)Drama

 根据不同年龄段,统计喜欢风格比例:

  1. ratings_with_genre = pd.merge(pd.merge(movies_exploded, ratings), users)
  2. ratings_with_genre.iloc[0]
  3. genre_ratings = (ratings_with_genre.groupby(["genre", "age"])
  4. ["rating"].mean()
  5. .unstack("age"))
  6. genre_ratings[:10]
age1182535455056
genre
Action3.5063853.4470973.4533583.5381073.5285433.6113333.610709
Adventure3.4499753.4085253.4431633.5152913.5289633.6281633.649064
Animation3.4761133.6240143.7012283.7405453.7348563.7800203.756233
Children's3.2416423.2942573.4268733.5184233.5275933.5565553.621822
Comedy3.4974913.4604173.4903853.5619843.5917893.6468683.650949
Crime3.7101703.6680543.6803213.7337363.7506613.8106883.832549
Documentary3.7307693.8658653.9466903.9537473.9665213.9081083.961538
Drama3.7947353.7219303.7264283.7825123.7843563.8784153.933465
Fantasy3.3176473.3537783.4524843.4823013.5324683.5815703.532700
Film-Noir4.1454553.9973684.0587254.0649104.1053764.1754014.125932
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/寸_铁/article/detail/841491
推荐阅读
相关标签
  

闽ICP备14008679号