赞
踩
本项目是用scrapy+python2.7下实现的
本来目标是tor+scrapy来搭建代理池,后来发现还要翻墙,太麻烦了,于是直接更换useragent的方法,发现知乎没有封ip,可以放心大胆的爬。还打算爬取作者,以及评论下的评论等,后面觉得没必要,都是重复的过程。
目标地址:https://www.zhihu.com/question/26049726
这是一个动态加载网站,我们需要解析js以及ajax。每次往下翻就会有查看更多回答,需要不断点击才能显示完整的回答。
我用的selenium+PhantomJS模拟不断的点击操作,实现全部的页面加载。
上面代码注释中提到了
结果展示:
直接先上代码:
ZhiHu.py
# encoding:utf-8 import sys from scrapy.spiders import CrawlSpider import scrapy import re import time from bs4 import BeautifulSoup from selenium import webdriver from SpderSpy.items import SpderspyItem reload(sys) sys.setdefaultencoding('utf8') class ZH(CrawlSpider): name = 'zhihu' allowed_domains = ['https://www.zhihu.com/'] def start_requests(self): coo = {} cookie = 'your cookie' #自己在浏览器下查看自己的cookie for seg in cookie.split(';'): #把cookie写成字典形式的 key,value = seg.split('=',1) coo[key] = value return [scrapy.FormRequest('https://www.zhihu.com/question/26049726',cookies=coo,callback=self.parse)] #模拟登录,其实不用登录也可以爬页面,当时担心需要就顺便写了下 def parse(self, response): print "spider beginning" words = open('clould.txt','w+') #评论写进的文件 url = response.url driver = webdriver.PhantomJS() driver.get(url) clk = driver.find_elements_by_xpath('//button[text()="查看更多回答"]') #找到按钮 count = 0 while clk: #直到页面没有“查看更多回答”这个按钮,就算是爬完了评论 for i in clk: i.click() #模拟点击 time.sleep(3) #这里有必要等3s,因为点击后有些内容没有加载出来 clk = driver.find_elements_by_xpath('//button[text()="查看更多回答"]') item = SpderspyItem() soup = BeautifulSoup(driver.page_source,'html.parser',from_encoding='utf-8') item['content'] = soup.find_all('span',class_="RichText ztext CopyrightRichText-richText") for i in item['content']: print unicode.encode(i.get_text(),'utf-8') #将unicode字符转成utf-8编码 print '\n\n' words.write(unicode.encode(i.get_text(),'utf-8')) count = count+1 print '总共有:%d 条评论'%count driver.quit()
settings.py 需要取消注释的:
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3
COOKIES_ENABLED = True
middlewares.py
import random
class SpderspySpiderMiddleware(object):
def process_request(self, request, spider):
user_agent_random = random.choice(self.useragent)
request.headers.setdefault('User-Agent', user_agent_random) #这样就是实现了User-Agent的随即变换
useragent = ['Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.91 Safari/537.36',
'Mozilla/5.0 (Linux; U; Android 1.5; de-de; Galaxy Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 ',
'Mozilla/5.0 (Linux; U; Android 2.2; en-gb; GT-P1000 Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1',
'Opera/9.80 (Android 2.3.3; Linux; Opera Mobi/ADR-1111101157; U; es-ES) Presto/2.9.201 Version/11.50 ',
'Opera/9.80 (Android 3.2.1; Linux; Opera Tablet/ADR-1109081720; U; ja) Presto/2.8.149 Version/11.10',
'BlackBerry7100i/4.1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 VendorID/103 ',
' Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; FujitsuToshibaMobileCommun; IS12T; KDDI)',
'Opera/9.80 (J2ME/MIDP; Opera Mini/5.1.22296; BlackBerry9800; U; AppleWebKit/23.370; U; en) Presto/2.5.25 Version/10.54 '
]
爬取之后生成clould.txt文件,接下来我们需要对这个文件的内容操作,做成我们需要的词云。
我们知道英文字母做词云相对来说比较简单,他们每个单词都是用空格分开,但是中文的格式就不一样,每个词都是连在一起,需要人为分开。这里我们用到了一个模块结巴 pip install jieba
,它能自动帮我们把句子分成词。还要用到一个中文字体库simsun.ttf 点击下载,这个必须要和前面的那个clould.txt文件放在同一目录下,否则会报错。
接下来,是配置生成词云的工具,jupyter notebook 用命令pip install jupyter notebook
即可安装。安装完成后,在终端下 jupyter notebook
直接就可以打开,效果如下:
选择右上角新建文件:
%pylab inline
import jieba
import matplotlib.pyplot as plt
from wordcloud import WordCloud
filename = 'clould.txt'
with open(filename) as f:
mytext = f.read()
mytext = " ".join(jieba.cut(mytext))
wordcloud = WordCloud(font_path="simsun.ttf").generate(mytext)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
点击上方的run得到如下结果:
点击图片可以保存到本地
代码github地址:https://github.com/quking/ZhihuCrawl (记得star)
csdn code 下载:点击下载
至此结束
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。