当前位置:   article > 正文

记录:一次爬取gitee项目名称和url_gitee爬虫

gitee爬虫

items:

  1. class GiteeItem(scrapy.Item):
  2. link = scrapy.Field()
  3. desc = scrapy.Field()
  4. pass

db:

  1. import emoji
  2. import pymysql
  3. connect = pymysql.connect(host='localhost', user='root', password='root', db='mindsa', charset='utf8mb4')
  4. cursor = connect.cursor()
  5. def insertGitee(item):
  6. sql = """INSERT INTO gitee(link,`desc`) VALUES ({},{})""".format("'" + emoji.demojize(item['link']) + "'",
  7. "'" + emoji.demojize(item['desc']) + "'")
  8. cursor.execute(sql)
  9. connect.commit()

pipelines:

  1. class GiteePipeline:
  2. def process_item(self, item, spider):
  3. insertGitee(item)

settings:

  1. ITEM_PIPELINES = {
  2. 'myscrapy.pipelines.GiteePipeline': 300,
  3. }
GiteeSprider:
  1. import scrapy
  2. from myscrapy.items import GiteeItem
  3. class GiteeSprider(scrapy.Spider):
  4. name = 'gitee'
  5. allow_domains = 'gitee.com'
  6. start_urls = ['https://gitee.com/explore/all']
  7. def parse(self, response, **kwargs):
  8. # 使用绝对路径定位标签
  9. elements = response.xpath('//div[@class="ui relaxed divided items explore-repo__list"]//div[@class="item"]')
  10. for element in elements:
  11. # 注意:再次进行xpath的时候是相对路径在需要//前面加上.。是.//而不是//
  12. link = self.allow_domains + element.xpath('.//h3/a/@href').get()
  13. desc = element.xpath('.//div[@class="project-desc"]/text()').get()
  14. item = GiteeItem()
  15. item['link'] = link
  16. item['desc'] = desc
  17. yield item
  18. # 注意:根据多个属性值进行xpath的时候,用and来连接。
  19. next_href__get = response.xpath(
  20. '//div[@class="ui tiny pagination menu"]//a[@class="icon item" and @rel="next"]/@href'
  21. ).get()
  22. if next_href__get is not None:
  23. # 如果存在下一页则继续请求
  24. yield scrapy.Request("https://gitee.com"+next_href__get, self.parse)

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/羊村懒王/article/detail/674920
推荐阅读
相关标签
  

闽ICP备14008679号