当前位置:   article > 正文

【Python爬虫系列】Python 爬取搜房网二手房数据_搜房 爬虫

搜房 爬虫

 本文简单介绍如何使用Python爬取搜房网二手房数据,并保存到MySQL数据库以备深入分析和应用。

 Python爬虫有很多第三方库或者框架可使用,本文使用到的库主要有 requests、BeautifulSoup4、MySQLdb。

闲话少说,直接上代码,相关说明已经在代码中作了说明和注释。


Python代码:

  1. # -*- coding:utf-8 -*-
  2. ############################################################################
  3. '''
  4. # 程序:上海搜房网爬虫
  5. # 功能:抓取上海搜房网二手房在售、成交数据
  6. # 创建时间:2017/01/03
  7. # 更新历史:2017/01/07 增加多城市处理、随机Header;
  8. # 增加爬取城市URL信息;封装为类,补充注释和日志
  9. #
  10. # 使用库:requests、BeautifulSoup4、MySQLdb
  11. # 作者:yuzhucu
  12. '''
  13. #############################################################################
  14. import requests
  15. from bs4 import BeautifulSoup
  16. import lxml
  17. import time
  18. import random
  19. import MySQLdb
  20. def randHeader():
  21. '''
  22. 随机生成User-Agent
  23. :return:
  24. '''
  25. head_connection = ['Keep-Alive', 'close']
  26. head_accept = ['text/html, application/xhtml+xml, */*']
  27. head_accept_language = ['zh-CN,fr-FR;q=0.5', 'en-US,en;q=0.8,zh-Hans-CN;q=0.5,zh-Hans;q=0.3']
  28. head_user_agent = ['Opera/8.0 (Macintosh; PPC Mac OS X; U; en)',
  29. 'Opera/9.27 (Windows NT 5.2; U; zh-cn)',
  30. 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Win64; x64; Trident/4.0)',
  31. 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0)',
  32. 'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; .NET4.0C; .NET4.0E)',
  33. 'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; .NET4.0C; .NET4.0E; QQBrowser/7.3.9825.400)',
  34. 'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0; BIDUBrowser 2.x)',
  35. 'Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070309 Firefox/2.0.0.3',
  36. 'Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070803 Firefox/1.5.0.12',
  37. 'Mozilla/5.0 (Windows; U; Windows NT 5.2) Gecko/2008070208 Firefox/3.0.1',
  38. 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.12) Gecko/20080219 Firefox/2.0.0.12 Navigator/9.0.0.6',
  39. 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36',
  40. 'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; rv:11.0) like Gecko)',
  41. 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0 ',
  42. 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Maxthon/4.0.6.2000 Chrome/26.0.1410.43 Safari/537.1 ',
  43. 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.92 Safari/537.1 LBBROWSER',
  44. 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36',
  45. 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/3.0 Safari/536.11',
  46. 'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko',
  47. 'Mozilla/5.0 (Macintosh; PPC Mac OS X; U; en) Opera 8.0'
  48. ]
  49. result = {
  50. 'Connection': head_connection[0],
  51. 'Accept': head_accept[0],
  52. 'Accept-Language': head_accept_language[1],
  53. 'User-Agent': head_user_agent[random.randrange(0, len(head_user_agent))]
  54. }
  55. return result
  56. def getCurrentTime():
  57. # 获取当前时间
  58. return time.strftime('[%Y-%m-%d %H:%M:%S]', time.localti
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Monodyee/article/detail/192345
推荐阅读
相关标签
  

闽ICP备14008679号