当前位置:   article > 正文

python实现豆瓣网Json数据爬取_python爬取网站json数据

python爬取网站json数据

相信大家一上手,就是对豆瓣的各种爬,但json数据是个例外,求职网也都是json数据,可爬

爬取这个页面的内容,按年份爬取

选电影 (douban.com)

这里演示的是爬取https://m.douban.com/rexxar/api/v2/movie/recommend/filter_tags?selected_categories=%7B%7D

直接点进去,相信大家看到的是这样

而不是这种

二手瓜子网的json数据

这里大家区别以下

1.爬虫的万能第一步:请求头

  1. class Spider:
  2. headers = {
  3. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 Edg/119.0.0.0',
  4. 'Referer': 'https://movie.douban.com/explore'
  5. }
'
运行

2.请求json数据,遍历年份

遍历年份,进行定位,按年份执行detail_parse函数爬取每个电影链接的内容,save_csv函数保存至文件

  1. def index(self):
  2. url = 'https://m.douban.com/rexxar/api/v2/movie/recommend/filter_tags?selected_categories=%7B%7D'
  3. html = requests.get(url=url, headers=self.headers).json()
  4. tags = html.get('tags')[0].get('tags')
  5. for tags in tags[2:]:
  6. movie_id = 0
  7. all_data = []
  8. for i in range(5):
  9. parse_url = f'https://m.douban.com/rexxar/api/v2/movie/recommend?refresh=0&start={i * 20}&count=20&selected_categories=%7B%7D&uncollect=false&tags={tags}'
  10. tags_html = requests.get(url=parse_url, headers=self.headers).json()
  11. for i in tags_html['items']:
  12. movie_id += 1
  13. uri = i['uri'].split('douban.com/')[-1]
  14. detail_url = 'https://www.douban.com/doubanapp/dispatch?uri=' + uri
  15. self.detail_parse(detail_url, all_data, movie_id)
  16. self.save_csv(tags, all_data)
'
运行

3.detail_pare函数实现每个电影的内容爬取

拿到每个电影的网址请求,转beautifulsoup对象,获取内容后顺便使用fill_full函数pipei函数对数据进行清理

  1. def detail_parse(self, detail_url, all_data, movie_id):
  2. self.random_sleep()
  3. html = requests.get(url=detail_url, headers=self.headers).content
  4. html = BeautifulSoup(html, 'lxml')
  5. movies_name = html.select_one('#content > h1 > span:nth-child(1)').text
  6. region_language = html.select_one('#info').get_text()
  7. region_language = region_language.split('\n')
  8. director, editor, actor, movie_type, region, language, on_time, duration = self.pipei(region_language[1:])
  9. year = on_time.split('-')[0]
  10. score = html.select_one('#interest_sectl > div.rating_wrap.clearbox > div.rating_self.clearfix > strong').text
  11. # 空值处理
  12. element = html.select_one(
  13. '#interest_sectl > div.rating_wrap.clearbox > div.rating_self.clearfix > div > div.rating_sum > a > span')
  14. comments = element.text if element is not None else ''
  15. five_star, four_star, three_star, two_star, one_star = self.fill_null(
  16. html.select('#interest_sectl > div.rating_wrap.clearbox > div.ratings-on-weight > div'))
  17. all_data.append(
  18. [movie_id, movies_name, year, director, editor, actor, movie_type, region, language, on_time, duration,
  19. score, comments, five_star, four_star, three_star, two_star, one_star])
  20. return all_data
'
运行

 4.空值处理 

通过爬取的数据观察,每个电影都有一到五星的评分,有的会有缺失,经过空值处理

  1. def fill_null(self, isNull):
  2. col = []
  3. try:
  4. for i in range(5):
  5. element = isNull[i].select_one('div > span.rating_per')
  6. col.append(element.text if element is not None else '')
  7. return col
  8. except:
  9. return ['', '', '', '', '']
'
运行

5.取出所需要字段

  1. def pipei(self, columns):
  2. temp_col, info = ['导演', '编剧', '主演', '类型', '制片国家/地区', '语言', '上映日期', '片长'], []
  3. temp = dict(zip([i.split(':')[0].replace(' ', '') for i in columns],
  4. [i.split(':')[-1].replace(' ', '') for i in columns]))
  5. for col in temp_col:
  6. info.append(temp.get(col, ""))
  7. return info
'
运行

6.保存至csv文件

  1. def save_csv(self, tags, all_data):
  2. with open(f'./FilmCrawl/{tags}.csv', 'w', newline='', encoding='utf-8') as fp:
  3. csv_write = csv.writer(fp)
  4. csv_write.writerow(
  5. ['movie_id', 'movies_name', 'year', 'director', 'editor', 'actor', 'movie_type', 'region', 'language',
  6. 'on_time', 'duration', 'score', 'comments', 'five_star', 'four_star', 'three_star', 'two_star',
  7. 'one_star'])
  8. for row in all_data:
  9. csv_write.writerow(row)
'
运行

7.随机休眠

每次请求随机休眠,防止被认定人机

  1. def random_sleep(self):
  2. sleep_time = random.uniform(1, 5)
  3. time.sleep(sleep_time)
'
运行

8.调用整个函数

  1. if __name__ == '__main__':
  2. if not os.path.exists('./FilmCrawl'):
  3. os.mkdir('./FilmCrawl')
  4. spider = Spider()
  5. spider.index()

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/秋刀鱼在做梦/article/detail/963308
推荐阅读
相关标签
  

闽ICP备14008679号