当前位置:   article > 正文

爬取微信小程序_微信小程序页面扒取页面

微信小程序页面扒取页面

  • 1

  • 1

-- coding: utf-8 --

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class WxSpider(CrawlSpider):
name = ‘wx’
allowed_domains = [‘wxapp-union.com’]
start_urls = [‘http://www.wxapp-union.com/portal.php?mod=list&catid=1&page=1’]

rules = (
    #列表页
    Rule(LinkExtractor(allow=r'http://www.wxapp-union.com/portal.php\?mod=list&catid=1&page=\d+'), follow=True),
    #详情页
    Rule(LinkExtractor(allow=r'http://www.wxapp-union.com/article-\d+-1.html'), callback='parse_item')
)

def parse_item(self, response):
    item = {}
    #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
    #item['name'] = response.xpath('//div[@id="name"]').get()
    #item['description'] = response.xpath('//div[@id="description"]').get()
    item['title']=response.xpath('//*[@id="ct"]/div[1]/div/div[1]/div/div[2]/div[1]/h1/text()').extract_first()
    item['author']=response.xpath('//*[@id="ct"]/div[1]/div/div[1]/div/div[2]/div[3]/div[1]/p/a/text()').extract_first()
    item['time']=response.xpath('//*[@id="ct"]/div[1]/div/div[1]/div/div[2]/div[3]/div[1]/p/span/text()').extract_first()
    return item
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16

  • 1

  • 1
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小丑西瓜9/article/detail/413802
推荐阅读
相关标签
  

闽ICP备14008679号