赞
踩
scrapy数据解析操作:利用scrapy爬取段子标题和内容
终端输入:
1、scrapy startproject qiushiPro创建爬虫文件夹
2、cd qiushiPro进入qiushiPro文件夹
3、scrapy genspider qiushi www.xxx.com创建爬虫代码qiushi.py
4、进入qiushi.py,修改如下:
import scrapy class QiushiSpider(scrapy.Spider): name = "qiushi" # allowed_domains = ["www.xxx.com"] start_urls = ["https://www.qiushile.com/duanzi/"] def parse(self, response): #解析:段子标题+段子内容 li_list = response.xpath('//*[@id="ct"]/div[1]/div[2]/ul') for li in li_list: #xpath返回的是列表,但是列表元素一定是Selector类型的对象 #extract可以将Selector对象中data参数存储的字符串提取出来 # title = li.xpath('./li/div[2]/div[1]/a/text()')[0].extract() title = li.xpath('./li/div[2]/div[1]/a/text()').extract_first() #列表调用了extract之后,则表示将列表中每一个Selector对象中data对应的字符串提取了出来 content = li.xpath('./li/div[2]/div[2]//text()')[0].extract() print(title,content) break
5、settings.py配置文件中修改ROBOTSTXT_OBEY,添加LOG_LEVEL、USER_AGENT。
#显示指定类型的日志信息
LOG_LEVEL = "ERROR"
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36 Edg/116.0.1938.76"
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
6、终端输入:scrapy crawl qiushi执行程序
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。