当前位置:   article > 正文

scrapy框架爬信息存入Mysql_scrapy框架爬取链家二手房数据存入不了mysql数据库代码终端也没有报错是怎么

scrapy框架爬取链家二手房数据存入不了mysql数据库代码终端也没有报错是怎么

1.命令提示符页面(cmd)

  1. scrapy startproject yaowen
  2. cd yaowen 
  3. scrapy genspider yw www.gov.cn 

2.item.py

  1. import scrapy
  2. class YaowenItem(scrapy.Item):
  3. title=scrapy.Field()
  4. date=scrapy.Field()
  5. url=scrapy.Field()
  6. neirong=scrapy.Field()

3.yw.py

  1. import scrapy
  2. import requests
  3. from yaowen.items import YaowenItem
  4. from urllib import parse
  5. class YwSpider(scrapy.Spider):
  6. name = 'yw'
  7. allowed_domains = ["www.gov.cn"]
  8. start_urls = ['http://www.gov.cn/xinwen/']
  9. def parse(self, response):
  10. total=response.xpath('//div[@class="zl_channel_body zl_channel_bodyxw"]/dl')
  11. for b in total:
  12. item=YaowenItem()
  13. title1 = b.xpath('./dd/h4/a/text()').extract()
  14. date1=b.xpath('./dd/h4/span/text()').extract()
  15. new_url1=b.xpath('./dd/h4/a/@href').extract()
  16. neirong1=b.xpath('./dd/p/text()').extract()
  17. # print("########################################################")
  18. # print(new_url1)
  19. page_url='http://gov.cn'
  20. new_url="".join(new_url1)
  21. title=list(map(str,title1))
  22. date=list(map(str,date1))
  23. neirong=list(map(str,neirong1))
  24. title=' '.join(title)
  25. date=' '.join(date)
  26. neirong=' '.join(neirong)
  27. # print(page_url,type(page_url),type(new_url))
  28. new_full_url=parse.urljoin(page_url,new_url)
  29. item['title']=title
  30. item['date']=date
  31. item['url']=new_full_url
  32. item['neirong']=neirong
  33. # print("****************************************************************")
  34. # print(item['title'])
  35. # print(item['url'])
  36. # print(item['date'])
  37. # print(item['neirong'])
  38. yield item
  39. def get_content(self,url):
  40. header={
  41. "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
  42. }
  43. cont=requests.get(url,headers=header)
  44. content=cont.content.decode("gb2312",errors='ignore')
  45. return content

4.pipelines.py,存入MySQL

  1. from itemadapter import ItemAdapter
  2. import pymysql
  3. #存入MySQL
  4. class MysqlPipeline(object):
  5. def __init__(self):
  6. self.conn=pymysql.connect(host='localhost',user='root',password='zhangrui2580456',database='shiyanzuoye',port=3306,charset='utf8')
  7. self.cursor=self.conn.cursor()#游标对象
  8. def process_item(self,item,spider):
  9. self.cursor.execute('insert into zuoyeTable(title,date,url,neirong) VALUES ("{}","{}","{}","{}")'.format(item['title'],item['date'],item['url'],item['neirong']))
  10. self.conn.commit()

5.settings.py

  1. BOT_NAME = 'yaowen'
  2. SPIDER_MODULES = ['yaowen.spiders']
  3. NEWSPIDER_MODULE = 'yaowen.spiders'
  4. COOKIES_ENABLED = False
  5. ITEM_PIPELINES = {
  6. 'yaowen.pipelines.MysqlPipeline': 300,
  7. 'yaowen.pipelines.MongodbPipeline':400

6.start.py(启动程序,在yaowen目录下新建start.py)

  1. from scrapy import cmdline
  2. def main():
  3. scrapy.cmdline.execute(["scrapy","crawl","yw"])
  4. # cmdline.excute("xcrapy crawl ")
  5. if __name__=='__main__':
  6. main()

7.结果展示

8.关于数据存入mongodb,爬数据时候的翻页爬取操作将以“论文发表网”为例发出

 

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小丑西瓜9/article/detail/196547
推荐阅读
相关标签
  

闽ICP备14008679号