当前位置:   article > 正文

大数据实训—BookCrossing数据集处理_book-crossing数据集

book-crossing数据集

数据集来源:Book-Crossing Datasethttp://www2.informatik.uni-freiburg.de/~cziegler/BX/数据导入

  1. import pandas_profiling
  2. import pandas as pd
  3. import seaborn as sns
  4. import matplotlib.pyplot as plt
  5. import requests
  6. from PIL import Image as im
  7. from wordcloud import WordCloud,STOPWORDS
  1. #Users
  2. u_cols = ['user_id', 'location', 'age']
  3. users = pd.read_csv('C:/Users/Desktop/推荐系统/第五次实验/BX-Users.csv', sep=';', names=u_cols, encoding='latin-1',low_memory=False)
  4. #Books
  5. i_cols = ['isbn', 'book_title' ,'book_author','year_of_publication', 'publisher', 'img_s', 'img_m', 'img_l']
  6. items = pd.read_csv('C:/Users/Desktop/推荐系统/第五次实验/BX-Books.csv', sep=';', names=i_cols, encoding='latin-1',low_memory=False)
  7. #Ratings
  8. r_cols = ['user_id', 'isbn', 'rating']
  9. ratings = pd.read_csv('C:/Users/Desktop/推荐系统/第五次实验/BX-Book-Ratings.csv', sep=';', names=r_cols, encoding='latin-1',low_memory=False)

观察前几个数据情况 

users.head()

 

users.describe()

 

users.dtypes

 

ratings.describe()

ratings.dtypes

 在上面我们看到第一行的数据没有用,我们删去:

  1. #剪去第一列
  2. users = users.drop(users.index[0])
  3. items = items.drop(items.index[0])
  4. ratings = ratings.drop(ratings.index[0])
  5. users

 统计空值:

users.isnull().sum()

类型转换:

  1. #类型转换
  2. users['age'] = users['age'].astype(float)
  3. users['user_id'] = users['user_id'].astype(int)
  4. ratings['user_id'] = ratings['user_id'].astype(int)
  5. ratings['rating'] = ratings['rating'].astype(int)
  6. items['year_of_publication'] = items['year_of_publication'].astype(int)
users.isnull().sum()

age概况:

users['age'].describe()

删除一点不合理的数据,对于一些年龄比较奇怪的

  1. import numpy as np
  2. users.loc[(users.age>99) | (users.age<5),'age'] = np.nan
  3. users.age = users.age.fillna(users.age.mean())#删去不合理数据
ratings.isnull().sum()
items.isnull().sum() #检查books空值情况

查看空值情况:

items.loc[items.publisher.isnull(),:]

 我们查阅资料,给他补上:

  1. #查阅资料,将空值填充
  2. items.loc[items.isbn=='193169656X','publisher']='Mundania Press LLC'
  3. items.loc[items.isbn=='1931696993','publisher']='Novelbooks Incorporated'

同理:

items.loc[items.book_author.isnull(),:]

  1. #再给他填上!!
  2. items.loc[items.isbn=='9627982032','book_author']='Larissa Anne Downe'
  1. #瞅一下书的时间合理不合理
  2. print(sorted(items['year_of_publication'].unique()))

 现在是2021年,最多也不能超过2021吧

  1. # 删除不合理的日期
  2. items.loc[(items.year_of_publication==0)|(items.year_of_publication>2021) ,'year_of_publication' ] = np.nan
  3. items.year_of_publication = items.year_of_publication.fillna(round(items.year_of_publication.mean()))

数据合并:

  1. #数据合并
  2. df = pd.merge(users, ratings, on='user_id')
  3. df = pd.merge(df, items, on='isbn')
  4. df.head(5)

 选取50岁以上的人和25岁以下人喜欢读的书:

  1. user_fit = df[(df['age']>50)]
  2. user_fit
  1. user_fit25 = df[(df['age']<25)]
  2. user_fit25

得到排名:

user_fit['book_title'].value_counts().head(10)

user_fit25['book_title'].value_counts().head(10)

  1. import pandas_profiling
  2. import pandas as pd
  3. import seaborn as sns
  4. import matplotlib.pyplot as plt
  5. import requests
  6. from PIL import Image as im
  7. from wordcloud import WordCloud,STOPWORDS
  1. plt.figure(figsize=(10,8))
  2. sns.distplot(df['age'],kde=False)
  3. plt.xlabel('Age')
  4. plt.ylabel('count')
  5. plt.title('Age Distribution',size=20)
  6. plt.show()

  1. df_v=df[['year_of_publication']].copy()
  2. df_v['year_of_publication'] = df_v['year_of_publication'].astype(int).astype(str)
  3. df_v=df_v['year_of_publication'].value_counts().head(25).reset_index()
  4. df_v.columns=['year','count']
  5. df_v['year']='Year '+df_v['year']
  6. plt.figure(figsize=(10,8))
  7. sns.barplot(x='count',y='year',data=df_v,palette=customPalette)
  8. plt.ylabel('Year Of Publication')
  9. plt.yticks(size=12)
  10. plt.title('Years of Publication',size=20)
  11. plt.show()

详细内容关注公众号,一起学习

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Gausst松鼠会/article/detail/193147
推荐阅读
相关标签
  

闽ICP备14008679号