当前位置:   article > 正文

python爬去知乎和简书内容_我想要获取简书和知乎的数据集

我想要获取简书和知乎的数据集

 一、爬取知乎热门内容

  1. # -*- coding: utf-8-*-
  2. import urllib2
  3. import re
  4. from BeautifulSoup import BeautifulSoup
  5. import sys
  6. reload(sys)
  7. sys.setdefaultencoding('utf8')
  8. f = open('howtoTucao2.txt', 'w') # open the file
  9. for pagenum in range(1, 21):
  10. strpagenum = str(pagenum)
  11. print "Getting data for Page " + strpagenum # for we can see the process in shell
  12. url = "http://www.zhihu.com/collection/27109279?page=" + strpagenum
  13. page = urllib2.urlopen(url) # get the web page
  14. soup = BeautifulSoup(page) # use BeautifulSoup to parsing the web page
  15. ALL = soup.findAll(attrs={'class': ['zm-item-title', 'content hidden']})
  16. for each in ALL:
  17. if each.name == 'h2':
  18. nowstring = re.sub('<s.+>\n<a.+>\n<.+>\n', '', each.a.string)
  19. nowstring = re.sub('<br>', '\n', nowstring)
  20. nowstring = re.sub('<\w+>', '', nowstring)
  21. nowstring = re.sub('</\w+>', '', nowstring)
  22. nowstring = re.sub('<.+>', '\n图片\n', nowstring)
  23. nowstring = re.sub('"', '"', nowstring)
  24. print nowstring
  25. if nowstring:
  26. f.write(nowstring)
  27. else:
  28. f.write("\n No Answer \n")
  29. else:
  30. nowstring = re.sub('<s.+>\n<a.+>\n<.+>\n', '', each.string)
  31. nowstring = re.sub('<br>', '\n', nowstring)
  32. nowstring = re.sub('<\w+>', '', nowstring)
  33. nowstring = re.sub('</\w+>', '', nowstring)
  34. nowstring = re.sub('<.+>', '\n图片\n', nowstring)
  35. nowstring = re.sub('"', '"', nowstring)
  36. print nowstring
  37. if nowstring:
  38. f.write(nowstring)
  39. else:
  40. f.write("\n No Answer \n")
  41. f.close() # close the file


二、爬取简书内容(基于Scrapy框架)

 (1)item.py

<
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/爱喝兽奶帝天荒/article/detail/836659
推荐阅读
相关标签
  

闽ICP备14008679号