当前位置:   article > 正文

【Python爬虫实战】:二手房数据爬取_python爬取二手房信息

python爬取二手房信息


前言

万维网上有着无数的网页,包含着海量的信息,无孔不入、森罗万象。但很多时候,无论出于数据分析或产品需求,我们需要从某些网站,提取出我们感兴趣、有价值的内容,但是纵然是进化到21世纪的人类,依然只有两只手,一双眼,不可能去每一个网页去点去看,然后再复制粘贴。所以我们需要一种能自动获取网页内容并可以按照指定规则提取相应内容的程序,这就是爬虫

一、爬虫技术是什么?

简单来讲,爬虫就是一个探测机器,它的基本操作就是模拟人的行为去各个网站溜达,从这个链接跳到那个链接,查查数据,或者把看到的信息传输回去。就像一只蜘蛛在互联网这张大网上不知疲倦的爬来爬去。

二、使用步骤

1.引入库

  1. import requests
  2. from pyquery import PyQuery as pq
  3. from fake_useragent import UserAgent
  4. import time
  5. import random
  6. import pandas as pd
  7. import pymysql

2.爬取数据

1.请求头

  1. UA = UserAgent()
  2. headers = {
  3. }

2.使用requests.get()请求

requests.get()是使用GET方法获取指定URL。

3.具体代码

  1. import requests
  2. from pyquery import PyQuery as pq
  3. from fake_useragent import UserAgent
  4. import time
  5. import random
  6. import pandas as pd
  7. import pymysql
  8. UA = UserAgent()
  9. headers = {
  10. 'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
  11. 'Cookie': 'lianjia_uuid=afd8b668-c5cd-4530-8e4e-3ef21570a8e1; _smt_uid=601176ce.984e553; UM_distinctid=177443816bf1a1-0c3cb4428f698d-13e3563-1fa400-177443816c020c; _ga=GA1.2.1832395402.1611757265; select_city=110000; lianjia_ssid=3f90d974-cc2f-4a53-a402-33ee1eb64b6b; Hm_lvt_9152f8221cb6243a53c83b956842be8a=1611757262,1614221173; Hm_lpvt_9152f8221cb6243a53c83b956842be8a=1614221173; CNZZDATA1253477573=1825509060-1611754241-%7C1614220268; CNZZDATA1254525948=1814375678-1611755575-%7C1614221158; CNZZDATA1255633284=201644086-1611755509-%7C1614216244; CNZZDATA1255604082=681395821-1611755268-%7C1614217306; srcid=eyJ0Ijoie1wiZGF0YVwiOlwiZWYwMzY0MDAzYTc3ZDgxYjUwNmVkMTA0ZGNlNTQzNjg0ZGM2NjdkYWRlZGFjY2Y5Zjc5ZWViZTQ5ZDFhN2I5MDM3N2FiMDc4ZjEwMDUzNjgwYzZhMjliMjQ4MzcxZDc3MjRhNGYwYzY4ZTNiYzI2OTE2Yjg2NTM0NDEyMDhiOTk4NjhhM2IwOTBiN2E0NjBiNDI4YWZhMDMwNjRjMzAxNWU2NTQyMDU4OGU2OTgzZTE0MmJjYTg2NmFmYmU4ZGRkNGFiNzA2YTE5ZjEwMmQ1NGQ5MTc1OTQxMzEyNzg2ZTM5M2Q3YjJiYThhMDhiYWI3YzBiZGE4NWNhZDdjOGMwNzFlOTljZmUzMGI3OGFkMTFkODM5N2VjZmRkNGUzNDllZjYzZjE0MGQ2OTYyNDZhYmJiNGM4ZjZmNjg3NTNjYjg2NGYzYWRmOWY4YjhhODY5ZTlhOGI4YWQzYmI1MTMyOVwiLFwia2V5X2lkXCI6XCIxXCIsXCJzaWduXCI6XCIxYWQ5OWVmZVwifSIsInIiOiJodHRwczovL2JqLmxpYW5qaWEuY29tL2Vyc2hvdWZhbmcvIiwib3MiOiJ3ZWIiLCJ2IjoiMC4xIn0=; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2217744381a3a4ba-062d85f2baa442-13e3563-2073600-17744381a3b518%22%2C%22%24device_id%22%3A%2217744381a3a4ba-062d85f2baa442-13e3563-2073600-17744381a3b518%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%7D; _gid=GA1.2.1275843823.1614221176; _gat=1; _gat_past=1; _gat_global=1; _gat_new_global=1; _gat_dianpu_agent=1',
  12. 'Host': 'nn.lianjia.com',
  13. 'Referer': 'https://nn.lianjia.com/ershoufang/',
  14. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36',
  15. }
  16. num_page = 2
  17. class Lianjia_Crawer:
  18. def __init__(self,txt_path):
  19. super(Lianjia_Crawer,self).__init__()
  20. self.file = str(txt_path)
  21. self.df = pd.DataFrame(columns = ['title','community','citydirct','houseinfo','dateinfo','taglist','totalprice','unitprice'])
  22. def run(self):
  23. '''启动脚本'''
  24. for i in range(100):
  25. url = "https://nn.lianjia.com/ershoufang/pg{}/".format(str(i))
  26. self.parse_url(url)
  27. time.sleep(2)
  28. self.df.to_csv(self.file, encoding='utf-8')
  29. print('正在爬取的 url 为 {}'.format(url))
  30. print('爬取完毕!!!!!!!!!!!!!!')
  31. def parse_url(self,url):
  32. headers['User-Agent'] = UA.chrome
  33. res = requests.get(url, headers=headers)
  34. doc = pq(res.text)
  35. for i in doc('.clear.LOGCLICKDATA .info.clear'):
  36. try:
  37. pq_i = pq(i)
  38. title = pq_i('.title').text().replace('必看好房', '')
  39. Community = pq_i('.flood .positionInfo a').text()
  40. HouseInfo = pq_i('.address .houseInfo').text()
  41. DateInfo = pq_i('.followInfo').text()
  42. TagList = pq_i('.tag').text()
  43. TotalPrice = pq_i('.priceInfo .totalPrice').text().replace('万', '')
  44. UnitPrice = pq_i('.priceInfo .unitPrice').text().replace('元/平', '')
  45. CityDirct = str(Community).split(' ')[-1]
  46. Community = str(Community).split(' ')[0]
  47. data_dict ={
  48. 'title':title,
  49. 'community':Community,
  50. 'citydirct':CityDirct,
  51. 'houseinfo':HouseInfo,
  52. 'dateinfo':DateInfo,
  53. 'taglist':TagList,
  54. 'totalprice':TotalPrice,
  55. 'unitprice':UnitPrice
  56. }
  57. print(Community,CityDirct)
  58. self.df = self.df.append(data_dict,ignore_index=True)
  59. #self.file.write(','.join([title, Community, CityDirct, HouseInfo, DateInfo, TagList, TotalPrice, UnitPrice]))
  60. print([title, Community, CityDirct, HouseInfo, DateInfo, TagList, TotalPrice, UnitPrice])
  61. self.df.to_csv("E:/pythonProject/LianJia/aaaa/ershoufang_lianjia.csv", encoding='utf-8')
  62. except Exception as e:
  63. print(e)
  64. print("索引提取失败,请重试!!!!!!!!!!!!!")
  65. if __name__ =="__main__":
  66. txt_path = "E:/pythonProject/LianJia/aaaa/ershoufang_lianjiaa.csv"
  67. Crawer = Lianjia_Crawer(txt_path)
  68. Crawer.run() # 启动爬虫脚本



总结

没有写在进数据库而是在一个csv的文件里面

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/你好赵伟/article/detail/192368
推荐阅读
相关标签
  

闽ICP备14008679号