赞
踩
先简单聊两句,距离上一篇博客大概过去了4个月,在忙一些别的事情,除了公司有新项目上线,学习新技术之外,博主恋爱了,嗯,奔着结婚的那种,荣升程序员鄙视链顶端,emmmmm,我想说,来呀,打我呀!
好了好了,这是一篇技术型博文,最近公司需求,爬取大众点评中几个连锁便利店的评论信息,因为只是一次需求,不用做成接口类型的,所以,记得之前学过python 的 requests + beautifulsoup 去爬取并处理爬取的页面的信息
连锁便利店:武汉的 7tt,today今天等
首先看一下
https://www.dianping.com/search/keyword/16/0_7tt
https://www.dianping.com/search/keyword/16/0_today今天
这是两个连锁便利店的列表路径,都是固定格式后拼接便利店名字
首先获取每个店的id,拼成这家店的详情链接,例如http://www.dianping.com/shop/22711693
点击最下面的更多点评,即可得到全部的评论的页面
所以最终的评论页面链接是http://www.dianping.com/shop/22711693/review_all
接着,点击下方的页码,会改变链接,即在后面拼/p2代表页数
http://www.dianping.com/shop/22711693/review_all/p2
所以可以通过获取最下方页码来遍历全部评论
那,怎么获取页码呢?
window下f12,mac下alt+comand+j
可以看到class=PageLink的一共有9个,所以循环时+1就行,代码如下:
url = "https://www.dianping.com/shop/%s/review_all" % i
r = requests.get(url, headers=headers, cookies=cookies,proxies = proxies)
# print r.text
soup = BeautifulSoup(r.text, 'lxml')
lenth = soup.find_all(class_='PageLink').__len__() + 1
这里得到的lenth就是这一页的页码
然后如何在这一页获取每个评论的用户名,星级,评论内容
如图是放在多个li里面的,所以先获取li,再通过li获取下面的内容
coment = []
coment = soupIn.select('.reviews-items li')
接着遍历li
for one in coment: try: if one['class'][0]=='item': continue except(KeyError),e: pass name = one.select_one('.main-review .dper-info .name') #print name.get_text().strip() name = name.get_text().strip() star = one.select_one('.main-review .review-rank span') #print star['class'][1][7:8] star = star['class'][1][7:8] pl = one.select_one('.main-review .review-words') pl['class'] = {'review-words'} words = pl.get_text().strip() returnList.append([title,name,star,words])
因为获取到的是class="reviews-items"下面所有的li,这里断点调试发现,除去class="item"的就行,所以进行了判断,
用户名name很好获取,这里的星级star是通过span中的class来表示的,class=“sml-str40” 表示4星,所以需要获取class属性并截取,
而最重要的评论,是有点击展开评论按钮,改变class="Hide"的,所以这里需要先去除掉评论div的Hide属性,直接定义覆盖: pl[‘class’] = {‘review-words’}
基本完成了,存到list[]中,然后写文件,或者数据库即可
访问需要带有请求头headers ,cookies才可以访问,cookies代表用户访问身份识别,其中的一些参数是要解析的,并且有时间戳,超时会失效等,headers中的referer表示你是从那个页码跳转过来的,如果不加referer会在访问几次后现在你继续访问,有爬虫嫌疑。
另外如果同一ip访问次数过多也会封ip的,这里就要用代理了proxies,python很简单,直接在请求中带上proxies参数就行,r = requests.get(url, headers=headers, cookies=cookies,proxies = proxies),另外代理ip的话,给大家推荐个网站http://www.data5u.com/,最下方会有20个免费的,一般小爬虫够用了,使用代理就会出现代理连接是否通之类的问题,需要在程序中添加下面的代码,设置连接时间
requests.adapters.DEFAULT_RETRIES = 5
s = requests.session()
s.keep_alive = False
最后的样子就是这样的
大致就是这样,下面附上代码,
欢迎关注我的微博@住街对面的查理,我的生活很有趣,你要不要来看一看。
#coding=utf-8 from bs4 import BeautifulSoup import sys reload(sys) sys.setdefaultencoding('utf-8') import json import requests list = [22711693,24759450,69761921,69761921,22743334,66125712,22743270,57496584,75153221,57641884,66061653,70669333,57279088,24740739,66126129, 75100027,92667587,92452007,72345827,90004047,90485109,90546031,83527455,91070982,83527745,94273474,80246564,83497073,69027373,96191554, 96683472,90500524,92454863,92272204,70443082,96076068,91656438,75633029,96571687,97659144,69253863,98279207,90435377,70669359,96403354, 83618952,81265224,77365611,74592526,90479676,56540304,37924067,27496773,56540319,32571869,43611843,58612870,22743340,67293664,67292945, 57641749,75157068,58934198,75156610,59081304,75156647,75156702,67293838,] returnList = [] proxies = { # "https": "http://14.215.177.73:80", "http": "http://202.108.2.42:80", } headers = { 'Host': 'www.dianping.com', 'Referer': 'http://www.dianping.com/shop/22711693', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/535.19', 'Accept-Encoding': 'gzip' } cookies = { '_lxsdk_cuid': '16146a366a7c8-08cd0a57dad51b-32637402-fa000-16146a366a7c8', 'lxsdk': '16146a366a7c8-08cd0a57dad51b-32637402-fa000-16146a366a7c8', '_hc.v': 'ec20d90c-0104-0677-bf24-391bdf00e2d4.1517308569', 's_ViewType': '10', 'cy': '16', 'cye': 'wuhan', '_lx_utm': 'utm_source%3DBaidu%26utm_medium%3Dorganic', '_lxsdk_s': '1614abc132e-f84-b9c-2bc%7C%7C34' } requests.adapters.DEFAULT_RETRIES = 5 s = requests.session() s.keep_alive = False for i in list: url = "https://www.dianping.com/shop/%s/review_all" % i r = requests.get(url, headers=headers, cookies=cookies,proxies = proxies) # print r.text soup = BeautifulSoup(r.text, 'lxml') lenth = soup.find_all(class_='PageLink').__len__() + 1 #print lenth for j in xrange(lenth): urlIn = "http://www.dianping.com/shop/%s/review_all/p%s" % (i, j) re = requests.get(urlIn, headers=headers, cookies=cookies,proxies =proxies) soupIn = BeautifulSoup(re.text, 'lxml') title = soupIn.title.string[0:15] #print title coment = [] coment = soupIn.select('.reviews-items li') for one in coment: try: if one['class'][0]=='item': continue except(KeyError),e: pass name = one.select_one('.main-review .dper-info .name') #print name.get_text().strip() name = name.get_text().strip() star = one.select_one('.main-review .review-rank span') #print star['class'][1][7:8] star = star['class'][1][7:8] pl = one.select_one('.main-review .review-words') pl['class'] = {'review-words'} words = pl.get_text().strip() returnList.append([title,name,star,words]) file = open("/Users/huojian/Desktop/store_shop.sql","w") for one in returnList: file.write("\n") file.write(unicode(one[0])) file.write("\n") file.write(unicode(one[1])) file.write("\n") file.write(unicode(one[2])) file.write("\n") file.write(unicode(one[3])) file.write("\n")
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。