当前位置:   article > 正文

菜鸟爬虫-scrapy爬取笔趣阁全站小说_一章节几页 如何爬小说

一章节几页 如何爬小说

一 创建爬虫:

cd到我们存放项目的文件夹
然后 scrapy startproject BQG
cd:BQG
然后 scrapy genspider biquge biduo.cc

二 使用pycharm打开我们的项目目录

1setting设置

这里我是将我的整个文件贴到这里.

# -*- coding: utf-8 -*-
from fake_useragent import UserAgent
# Scrapy settings for BQG project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'BQG'

SPIDER_MODULES = ['BQG.spiders']
NEWSPIDER_MODULE = 'BQG.spiders'
LOG_LEVEL = 'WARNING'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = UserAgent().chrome

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
# }

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'BQG.middlewares.BqgSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'BQG.middlewares.BqgDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'BQG.pipelines.BqgPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92

2 spider文件写法

这里我们简单分析:

(1)parse函书获取首页的链接,在这里我们获取小说分类的链接url(https://www.biduo.cc/book_1_1/),小说章节页面的url.小说分类页面的链接(https://www.biduo.cc/biquge/56_56606/),我们继续返回parse函数解析,获取我们的俩种目标链接.小说章节也得链接,我们由下一个函书解析get_novel.
(2)get_novel次函数只要获取小说名称,小说首页的链接(https://www.biduo.cc/biquge/56_56606/c100952.html).(这里不获取所有章节的链接,是因为scrapy爬虫是多线程的,如果获取所有章节链接,那么返回的时候,顺序不对.)通过首页的链接,在get_page_content函书获取整个小说的章节内容.
(3)get_page_content此函数获取小说每一章节的名称,内容.这里获取首页之后,然后通过翻页获取下一页小说的内容.可以避免返回数据的顺序问题.
# -*- coding: utf-8 -*-
import scrapy
from lxml import etree
import re

class BiqugeSpider(scrapy.Spider):
    name = 'biquge'
    allowed_domains = ['biduo.cc']
    start_urls = ['https://www.biduo.cc/']

    def parse(self, response):
        # print(response.text)
        pat = r'href="/book_\d+_\d+/">'
        pat = r'/book_\d+_\d+/'
        tab_lists = re.findall(pat, response.text)
        print("****************************")
        for li in tab_lists:
            yield scrapy.Request(
                url='https://www.biduo.cc/'+ li,
                callback=self.parse,
            )
        pat1 = r'/biquge/\d+_\d+\/'
        t_lists = re.findall(pat1, response.text)
        for li in t_lists:
            # print(li)
            yield scrapy.Request(
                url='https://www.biduo.cc' + li,
                callback=self.get_novel,
            )
    def get_novel(self,response):
        novel_url = response.url
        novel_title = response.xpath('//div[@id="info"]/h1/text()').extract_first()
        # novel_lists = response.xpath('//div[@id="list"]/dl/dd/a/@href').extract()
        novel_first = 'https://www.biduo.cc' + response.xpath('//div[@id="list"]/dl/dd[1]/a/@href').extract_first()
        yield scrapy.Request(
            url = novel_first,
            callback=self.get_page_content,
            meta={'novel_title':novel_title,'novel_url':novel_url}
        )
    def get_page_content(self,response):
        item = {}
        item['novel_title'] = response.meta['novel_title']
        item['novel_url'] = response.meta['novel_url']
        item['page_title'] = response.xpath('//h1/text()').extract_first()
        item['page_url'] = response.url
        item['page_content'] = ''.join(response.xpath('//div[@id="content"]/text()').extract()).replace('\xa0','')
        # item['page_content'] = response.xpath('//div[@id="content"]/text()').extract()
        yield item
        next1 = response.xpath('//div[@class="bottem2"]/a[3]/@href').extract_first()
        # //*[@id="wrapper"]/div[4]/div/div[2]/div[1]/a[3]
        print(next1)

        next_url = 'https://www.biduo.cc' + next1
        # response.urljoin(next_url)
        if next_url != item['novel_url']:
            yield scrapy.Request(
                url = next_url,
                callback=self.get_page_content,
                meta={'novel_title':item['novel_title'],'novel_url':item['novel_url']}
            )

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61

3pipline文件

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class BqgPipeline(object):
    def start(self,item,spider):
        self.file = open("novels/{}.txt".format(item['novel_title']), 'a+', encoding='utf-8')
    def process_item(self, item, spider):
        self.file = open("novels/{}.txt".format(item['novel_title']), 'a+', encoding='utf-8')
        print(item['page_title'])
        self.file.write(item['page_title']+'\n')
        self.file.write(item['page_url']+'\n')
        self.file.write(item['page_content']+'\n')
        return item

    def closed(self,item,spider):
        self.file.close()

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22

4运行效果:

在这里插入图片描述

5 源码已上传到我的资源里面.如果需要可以下载.

声明:本文内容由网友自发贡献,转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号