赞
踩
爬虫用来自动获取网络上信息。Python因其丰富的第三方库和易读性,成为了爬虫开发的热门选择。
1. Python环境配置
安装Python3.x版本并配置好环境。
2. 常用库介绍
3. 安装库
- pip install requests
- pip install beautifulsoup4
- pip install lxml
- pip install scrapy
1. HTTP请求
爬虫主要通过HTTP协议与服务器进行通信,常用的请求方法有GET和POST。
示例代码:
- import requests
-
- url = 'https://www.example.com'
- response = requests.get(url)
-
- print(response.status_code) # 输出状态码
- print(response.text) # 输出响应内容
2. HTML解析
为了提取页面中的信息,我们需要解析HTML代码。BeautifulSoup是一个易用且功能强大的HTML解析库。
示例代码:
- from bs4 import BeautifulSoup
-
- html = '''
- <html>
- <head>
- <title>示例网页</title>
- </head>
- <body>
- <h1>欢迎来到示例网页</h1>
- <p>这是一个段落。</p>
- <a href="https://www.example.com/page2">链接到第二页</a>
- </body>
- </html>
- '''
-
- soup = BeautifulSoup(html, 'lxml')
-
- title = soup.title.string # 获取标题
- h1 = soup.h1.string # 获取h1标签的内容
- link = soup.a['href'] # 获取链接地址
-
- print(title)
- print(h1)
- print(link)
3. 实战:爬取豆瓣电影Top250
- import requests
- from bs4 import BeautifulSoup
-
- headers = {
- 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36'
- }
-
- def get_movie_info(url):
- response = requests.get(url, headers=headers)
- soup = BeautifulSoup(response.text, 'lxml')
-
- movie_list = soup.find('ol', class_='grid_view')
- if movie_list is None:
- print("未找到电影列表")
- return
-
- for movie in movie_list.find_all('li'):
- rank = movie.find('em').string
- title = movie.find('span', class_='title').string
- rating = movie.find('span', class_='rating_num').string
- link = movie.find('a')['href']
-
- print(f"排名:{rank}")
- print(f"电影名称:{title}")
- print(f"评分:{rating}")
- print(f"链接:{link}")
- print("-------")
-
-
- def main():
- base_url = "https://movie.douban.com/top250?start="
- for i in range(0, 250, 25):
- url = base_url + str(i)
- print(f"正在爬取第{i // 25 + 1}页")
- get_movie_info(url)
-
-
- if __name__ == '__main__':
- main()
1. 异常处理
使用Python的try-except
语句来处理异常。
示例代码:
- import requests
- from bs4 import BeautifulSoup
-
- def get_page(url):
- try:
- response = requests.get(url)
- if response.status_code == 200:
- return response.text
- except requests.RequestException:
- print("请求失败")
- return None
-
- def parse_page(html):
- try:
- soup = BeautifulSoup(html, 'lxml')
- title = soup.title.string
- print(title)
- except Exception as e:
- print(f"解析失败: {e}")
-
- def main():
- url = "https://www.example.com"
- html = get_page(url)
- if html:
- parse_page(html)
-
- if __name__ == '__main__':
- main()
2. 多线程爬虫
当爬取大量数据时,可以使用多线程提高爬虫效率。Python的threading
库可以帮助我们实现多线程爬虫。
示例代码:
- import requests
- from bs4 import BeautifulSoup
- import threading
-
- headers = {
- 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36'
- }
-
-
- def get_movie_info(url):
- response = requests.get(url, headers=headers)
- soup = BeautifulSoup(response.text, 'lxml')
-
- movie_list = soup.find('ol', class_='grid_view')
- if movie_list is None:
- print("未找到电影列表")
- return
-
- for movie in movie_list.find_all('li'):
- rank = movie.find('em').string
- title = movie.find('span', class_='title').string
- rating = movie.find('span', class_='rating_num').string
- link = movie.find('a')['href']
-
- print(f"排名:{rank}")
- print(f"电影名称:{title}")
- print(f"评分:{rating}")
- print(f"链接:{link}")
- print("-------")
-
-
- def run(start):
- url = f"https://movie.douban.com/top250?start={start}"
- get_movie_info(url)
-
-
- def main():
- threads = []
- for i in range(0, 250, 25):
- t = threading.Thread(target=run, args=(i,))
- threads.append(t)
- t.start()
-
- for t in threads:
- t.join()
-
-
- if __name__ == '__main__':
- main()
1. 创建Scrapy项目
首先,使用以下命令创建一个Scrapy项目:
scrapy startproject myspider
这将创建一个名为myspider
的Scrapy项目,项目结构如下:
- myspider/
- scrapy.cfg
- myspider/
- __init__.py
- items.py
- middlewares.py
- pipelines.py
- settings.py
- spiders/
- __init__.py
2. 定义Item
items.py
文件用于定义需要爬取的数据结构。在本例中,我们将爬取豆瓣电影Top250的数据。
修改items.py
如下:
- import scrapy
-
- class DoubanMovieItem(scrapy.Item):
- rank = scrapy.Field()
- title = scrapy.Field()
- rating = scrapy.Field()
- link = scrapy.Field()
3. 编写爬虫
在spiders
目录下创建一个名为douban_spider.py
的文件,编写爬虫代码:
- import scrapy
- from ..items import DoubanMovieItem
-
- class DoubanSpider(scrapy.Spider):
- name = 'douban'
- allowed_domains = ['movie.douban.com']
- start_urls = [f'https://movie.douban.com/top250?start={i}' for i in range(0, 250, 25)]
-
- def parse(self, response):
- movie_list = response.css('ol.grid_view li')
- for movie in movie_list:
- item = DoubanMovieItem()
- item['rank'] = movie.css('em::text').get()
- item['title'] = movie.css('span.title::text').get()
- item['rating'] = movie.css('span.rating_num::text').get()
- item['link'] = movie.css('div.hd a::attr(href)').get()
- yield item
4. 数据存储和反爬
Scrapy支持将爬取到的数据保存到多种格式,如JSON、CSV等。在本例中,我们将数据保存为JSON文件。
在settings.py
文件中,添加以下设置:
- FEED_FORMAT = 'json'
- FEED_URI = 'douban_top250.json'
- USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0'
- DOWNLOAD_DELAY = 3
5. 运行爬虫
在项目根目录下执行以下命令,启动爬虫:
scrapy crawl douban
运行完成后,在项目根目录下会生成一个名为douban_top250.json
的文件,其中包含了爬取到的数据。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。