赞
踩
Scrapy框架
Scrapy是python下实现爬虫功能的框架,能够将数据解析、数据处理、数据存储合为一体功能的爬虫框架。
Scrapy安装
安装依赖包
yum install gcc libffi-devel python-devel openssl-devel -y
yum install libxslt-devel -y
pip install scrapy<br>pip install twisted==13.1.0
注意事项:scrapy和twisted存在兼容性问题,如果安装twisted版本过高,运行scrapy startproject project_name的时候会提示报错,安装twisted==13.1.0即可。
3.1. 爬虫目标,获取简书中热门专题的数据信息,站点为https://www.jianshu.com/recommendations/collections,点击"热门"是我们需要爬取的站点,该站点使用了AJAX异步加载技术,通过F12键——Network——XHR,并翻页获取到页面URL地址为https://www.jianshu.com/recommendations/collections?page=2&order_by=hot,通过修改page=后面的数值即可访问多页的数据,如下图:
需要爬取专题的内容包括:专题内容、专题描述、收录文章数、关注人数,Scrapy使用xpath来清洗所需的数据,编写爬虫过程中可以手动通过lxml中的xpath获取数据,确认无误后再将其写入到scrapy代码中,区别点在于,scrapy需要使用extract()函数才能将数据提取出来。
##代码内容
import scrapy
from scrapy import Item
from scrapy import Field
'''
遇到不懂的问题?Python学习交流群:1136201545满足你的需求,资料都已经上传群文件,可以自行下载!
'''
class JianshuHotTopicItem(scrapy.Item):
'''
@scrapy.item,继承父类scrapy.Item的属性和方法,该类用于定义需要爬取数据的子段
'''
collection_name = Field()
collection_description = Field()
collection_article_count = Field()
collection_attention_count = Field()
#_*_ coding:utf8 _*_ import random from time import sleep from scrapy.spiders import CrawlSpider from scrapy.selector import Selector from scrapy.http import Request from jianshu_hot_topic.items import JianshuHotTopicItem class jianshu_hot_topic(CrawlSpider): ''' 简书专题数据爬取,获取url地址中特定的子段信息 ''' name = "jianshu_hot_topic" start_urls = ["https://www.jianshu.com/recommendations/collections?page=2&order_by=hot"] def parse(self,response): ''' @params:response,提取response中特定字段信息 ''' item = JianshuHotTopicItem() selector = Selector(response) collections = selector.xpath('//div[@class="col-xs-8"]') for collection in collections: collection_name = collection.xpath('div/a/h4/text()').extract()[0].strip() collection_description = collection.xpath('div/a/p/text()').extract()[0].strip() collection_article_count = collection.xpath('div/div/a/text()').extract()[0].strip().replace('篇文章','') collection_attention_count = collection.xpath('div/div/text()').extract()[0].strip().replace("人关注",'').replace("· ",'') item['collection_name'] = collection_name item['collection_description'] = collection_description item['collection_article_count'] = collection_article_count item['collection_attention_count'] = collection_attention_count yield item urls = ['https://www.jianshu.com/recommendations/collections?page={}&order_by=hot'.format(str(i)) for i in range(3,11)] for url in urls: sleep(random.randint(2,7)) yield Request(url,callback=self.parse)
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import csv class JianshuHotTopicPipeline(object): def process_item(self, item, spider): f = file('/root/zhuanti.csv','a+') writer = csv.writer(f) writer.writerow((item['collection_name'],item['collection_description'],item['collection_article_count'],item['collection_attention_count'])) return item
ITEM_PIPELINES = {
'jianshu_hot_topic.pipelines.JianshuHotTopicPipeline': 300,
}
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。