赞
踩
一、使用工具:
python3.7、pycharm、scrapy
二、实现步骤
1、选择文件夹创建scrapy文件。首先打开cmd控制台,cd到scrapy文件夹,输入 scrapy stratproject [ ],其中【】输入你所创建的文件夹名称。
然后cd csdn,注:我创建的文件夹为csdn,在文件夹下输入 scrapy genspider -t crawl csdn_url runoob.com,其中csdn_url是python文件名, runoob.com为菜鸟教程域名。该方法能够爬取一定规则的网页,之后会讲到。
创建完的文件夹内容如下:
最后,为了在pycharm能够运行该文件,在csdn目录下创建一个start.py,内容如下:
- from scrapy import cmdline
-
- cmdline.execute("scrapy crawl csdn_url".split())
2、scrapy基本信息设置
首先在settings.py中进行以下设置
然后分析菜鸟教程网页,将抓取原则写入csdn_url.py,直接上代码,其中start_urls为初始网页链接,rule是设定的代码规则,菜鸟教程中的链接都是https://www.runoob.com/python3/python3-作为前缀。
- # -*- coding: utf-8 -*-
- import scrapy
- from scrapy.linkextractors import LinkExtractor
- from scrapy.spiders import CrawlSpider, Rule
- from csdn.items import CsdnItem
-
- class CsdnUrlSpider(CrawlSpider):
- name = 'csdn_url'
- allowed_domains = ['runoob.com']
- start_urls = ['https://www.runoob.com/python3/python3-tutorial.html']
-
- rules = (
- Rule(LinkExtractor(allow=r'https://www.runoob.com/python3/python3-+'), callback='parse_item', follow=False),
- )
-
- def parse_item(self, response):
- name = response.xpath('//div[@class="article-intro"]/h1/text()').get()
- if response.xpath('//div[@class="article-intro"]/h1/span/text()').get():
- name += response.xpath('//div[@class="article-intro"]/h1/span/text()').get()
- contents = response.xpath('//div[@class="article-intro"]//text()').getall()
- title = []
- title.append(name)
- if response.xpath('//div[@class="article-intro"]/h2/text()').get():
- title_2 = response.xpath('//div[@class="article-intro"]/h2/text()').getall()
- title += title_2
- if response.xpath('//div[@class="article-intro"]/h3/text()').get():
- title_3 = response.xpath('//div[@class="article-intro"]/h3/text()').getall()
- title += title_3
- print("===============")
- print(name)
- print(title)
- content_list = []
- for i in contents:
- # if content=="\r\n":
- # continue
- if "\t" in i:
- continue
- if "\n" in i:
- continue
- if i in title:
- content_list.append("\n")
- content_list.append(i.strip())
- if i in title:
- content_list.append("\n")
- content = " ".join(content_list)
- print(content)
- item = CsdnItem(name=name, content=content)
- print(item)
- yield item
再设置items.py,本案例只爬取教程标题和内容:
- import scrapy
-
- class CsdnItem(scrapy.Item):
- name = scrapy.Field()
- content = scrapy.Field()
最后在pipelines.py设置储存方式与路径,分别储存为json格式和txt格式:
- from scrapy.exporters import JsonLinesItemExporter
-
- class CsdnPipeline(object):
- def process_item(self, item, spider):
- self.fp = open("cainiao.json", "wb")
- self.ft = open("cainiao.txt", "a", encoding="utf-8")
- self.ft.write(str(item["name"]) + '\n')
- self.ft.write(str(item["content"]) + '\n\t')
- self.exporter = JsonLinesItemExporter(self.fp, ensure_ascii=False, encoding="utf-8")
- self.exporter.export_item(item)
- return item
-
- def close_spider(self, spider):
- self.fp.close()
- self.ft.close()
3、结果如下
原网页
结果:
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。