赞
踩
@[TOC]百度飞桨Python小白逆袭大神结营心得
很开心参加了这次百度飞桨的python小白逆袭大神的课程,课程内容从Python入手,绝对0基础,老师由浅入深讲解,十分清晰,课程设计也特别有层次感,架构清晰,收获颇丰,总的收获可以概括为以下几点。
import json import re import requests import datetime from bs4 import BeautifulSoup import os today = datetime.date.today().strftime('%Y%m%d') def crawl_wiki_data(): """ 爬取百度百科中《青春有你2》中参赛选手信息,返回html """ headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' } url='https://baike.baidu.com/item/青春有你第二季' try: response = requests.get(url,headers=headers) print(response.status_code) soup = BeautifulSoup(response.text,'lxml') tables = soup.find_all('table',{ 'class':'table-view log-set-param'}) crawl_table_title = "参赛学员" for table in tables: table_titles = table.find_previous('div').find_all('h3') for title in table_titles: if(crawl_table_title in title): return table except Exception as e: print(e)
②.对爬取的页面数据进行解析,并保存为JSON文件
def crawl_pic_urls(): ''' 爬取每个选手的百度百科图片,并保存 ''' with open('work/'+ today + '.json', 'r', encoding='UTF-8') as file: json_array = json.loads(file.read()) headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' } for star in json_array: name = star['name'] link = star['link'] response = requests.get(link,headers=headers) bs = BeautifulSoup(response.text,'lxml') pic_list_url = bs.select('.summary-pic a')[0].get('href'
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。