赞
踩
获取猫眼电影top100榜单电影名字、评分等。将提取到的数据保存与csv文件,并对电影评分进行可视化。
import request
import urllib.request
import gzip
from io import BytesIO
from lxml import etree
实现过程
使用面向对象方式实现数据提取
self.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
模拟浏览器实现提取,适用于方法二
self.url = "https://maoyan.com/board/4?offset={}"
def parse_url(self,url):
response = requests.get(url,headers=self.headers)
return response.content.decode()
def parse_url(self,url):
ret = urllib.request.Request(url,headers=self.headers)
response = urllib.request.urlopen(ret)
html = response.read()
buff = BytesIO(html)
f = gzip.GzipFile(fileobj=buff)
html = f.read().decode("utf-8")
return html
def get_content_list(self,content_str,file): html = etree.HTML(content_str) # print(content_str) # 分组 element_xpath = html.xpath("//*[@class='container']/div/div/div[1]/dl/dd") print(element_xpath) # print(element_xpath) # 确保xpath方法得到的element对象无为空 for element in element_xpath: content_list = [] title_first = element.xpath("./div/div/div[1]/p[1]/a/@title") # print(title_first) content_list.append(title_first[0]) text_list = element.xpath("./div/div/div[1]/p[3]/text()") content_list.append(text_list[0]) # print(content_list) num_count = element.xpath("./div/div/div[2]/p/i/text()") str_i = "" for b in num_count: str_i += b content_list.append(str_i) print(content_list) self.write_csv(content_list,file)
使用etree.HTML方法将提取到的数据复制到html对象,html对象可使用xpath。先分组得到有关dd的elelment对象列表,遍历后得到每一个dd的elemtent对象,对每一个dd数据提取对应的电影等数据。将每一条电影数据保存到一个列表中,写入到csv文件中。
def write_csv(self,content_list,file):
file.write(",".join(content_list))
file.write("\n")
程序逻辑和执行
def run(self): # 构造csv文件 file = open("电影数据.csv","w",encoding="utf-8") csv_list = ["电影","演员","上映时间"] file.write(",".join(csv_list)) file.write("\n") offset = 0 # url地址请求循环 while offset <100: # 构造url地址,使用字符串格式化的方法得到url url = self.url.format(offset) # 发送请求获取response响应,提取数据,保存数据csv content_str = self.parse_url(url) self.get_content_list(content_str,file) # 下一页url数据提取 offset += 10 file.close()
if __name__ == '__main__':
maoyan = Maoyanspider()
maoyan.run()
得到csv数据
-
import re
p = re.compile('<div class="movie-item-info">.*?title="(.*?)".*?class="star">(.*?)</p>.*?class="releasetime">(.*?)</p>.*?</div>',re.S)
content_list = p.findall(html)
在提取数据函数中使用re模块提取需要的数据。
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv("./电影数据3.csv")
print(df.head())
print(df.info())
_x = df["电影"].values[:20]
_y = df["评分"].values[:20]
plt.figure(figsize=(20,8),dpi=80)
plt.plot(range(len(_x)),_y)
plt.xticks(range(len(_x)),_x,rotation=45,fontproperties="SimHei")
for x,y in enumerate(_y):
print(x,y)
plt.text(x,y+0.1,y,ha="center")
plt.show()
绘制评分对比折线图,使用text方法在图上显示对应的具体数据。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。