赞
踩
python3.8
selenium(请确保你已经成功安装了谷歌驱动chromedriver)
mongodb数据库
mongo-compass
谷歌浏览器
打开淘宝首页的链接,https://www.taobao.com/如下:
这里以商品ipad为例,在搜索框输入ipad,点击搜索,如下所示:
复制前四页的链接:找一下规律
第一页:https://uland.taobao.com/sem/tbsearch?refpid=mm_26632258_3504122_32538762&keyword=ipad&clk1=131aad029d6faa56c03288f51979aa45&upsId=131aad029d6faa56c03288f51979aa45&spm=a2e0b.20350158.search.1&pid=mm_26632258_3504122_32538762&union_lens=recoveryid%3A201_11.170.87.38_13712137_1628510033830%3Bprepvid%3A201_11.136.53.174_13740411_1628511461979
第二页:https://uland.taobao.com/sem/tbsearch?refpid=mm_26632258_3504122_32538762&keyword=ipad&clk1=131aad029d6faa56c03288f51979aa45&upsId=131aad029d6faa56c03288f51979aa45&spm=a2e0b.20350158.31919782.1&pid=mm_26632258_3504122_32538762&union_lens=recoveryid%3A201_11.170.87.38_13712137_1628510033830%3Bprepvid%3A201_11.136.53.174_13740411_1628511461979&pnum=1
第三页:https://uland.taobao.com/sem/tbsearch?refpid=mm_26632258_3504122_32538762&keyword=ipad&clk1=131aad029d6faa56c03288f51979aa45&upsId=131aad029d6faa56c03288f51979aa45&spm=a2e0b.20350158.31919782.1&pid=mm_26632258_3504122_32538762&union_lens=recoveryid%3A201_11.170.87.38_13712137_1628510033830%3Bprepvid%3A201_11.136.53.174_13740411_1628511461979&pnum=2
第四页:https://uland.taobao.com/sem/tbsearch?refpid=mm_26632258_3504122_32538762&keyword=ipad&clk1=131aad029d6faa56c03288f51979aa45&upsId=131aad029d6faa56c03288f51979aa45&spm=a2e0b.20350158.31919782.1&pid=mm_26632258_3504122_32538762&union_lens=recoveryid%3A201_11.170.87.38_13712137_1628510033830%3Bprepvid%3A201_11.136.53.174_13740411_1628511461979&pnum=3
…
很明显的可以看到除了第一页链接有些特殊,其他几页的链接基本一样,唯一的区别是最后的pnum的值不一样!分别为1,2,3。还可以发现当我们把值设置为0时,打开链接就是第一页的数据!可以根据这个规律构造前十页的链接:
base_url = 'https://uland.taobao.com/sem/tbsearch?refpid=mm_26632258_3504122_32538762&keyword=ipad&clk1=131aad029d6faa56c03288f51979aa45&upsId=131aad029d6faa56c03288f51979aa45&spm=a2e0b.20350158.31919782.1&pid=mm_26632258_3504122_32538762&union_lens=recoveryid%3A201_11.170.87.38_13712137_1628510033830%3Bprepvid%3A201_11.136.53.174_13740411_1628511461979&pnum={}'
url_list = [base_url.format(i) for i in range(10)]
直接使用requests获取网页源码是获取不到全部数据的,因此这里选择使用selenium获取网页源码数据,代码如下:
def get_html(url):
browser = webdriver.Chrome()
browser.get(url)
response = browser.page_source
browser.close()
return response
解析数据可以使用lxml模块。因为xpath表达式可以直接从谷歌浏览器复制,所以用这个比较简单!我这里简单了商品标签、店铺名称、商品价格、以及销量四个数据。
代码如下:
def parser(response):
html = etree.HTML(response)
li_list = html.xpath('//*[@id="mx_5"]/ul/li')
ipad_info = []
for li in li_list:
title = li.xpath('./a/div[1]/span/text()')[0]
price = li.xpath('./a/div[2]/span[2]/text()')[0]
shop = li.xpath('./a/div[3]/div/text()')[0]
sales = li.xpath('./a/div[4]/div[2]/text()')[0]
ipad_info.append({'title': title, 'price': price, 'shop': shop, 'sales': sales})
return ipad_info
把解析的数据保存到mongodb数据库。这里新建一个名为Taobao的数据库,集合名称为ipad_info
def save_info_to_mongo(ipad_info):
client = pymongo.MongoClient('localhost', 27017)
collection = Collection(Database(client, 'Taobao'), 'ipad_info')
for info in ipad_info:
collection.insert_one(info)
client.close()
import pymongo from lxml import etree from selenium import webdriver from pymongo.collection import Collection from pymongo.database import Database def get_html(url): browser = webdriver.Chrome() browser.get(url) response = browser.page_source browser.close() return response def parser(response): html = etree.HTML(response) li_list = html.xpath('//*[@id="mx_5"]/ul/li') ipad_info = [] for li in li_list: title = li.xpath('./a/div[1]/span/text()')[0] price = li.xpath('./a/div[2]/span[2]/text()')[0] shop = li.xpath('./a/div[3]/div/text()')[0] sales = li.xpath('./a/div[4]/div[2]/text()')[0] ipad_info.append({'title': title, 'price': price, 'shop': shop, 'sales': sales}) return ipad_info def save_info_to_mongo(ipad_info): client = pymongo.MongoClient('localhost', 27017) collection = Collection(Database(client, 'Taobao'), 'ipad_info') for info in ipad_info: collection.insert_one(info) client.close() if __name__ == '__main__': base_url = 'https://uland.taobao.com/sem/tbsearch?refpid=mm_26632258_3504122_32538762&keyword=ipad&clk1=131aad029d6faa56c03288f51979aa45&upsId=131aad029d6faa56c03288f51979aa45&spm=a2e0b.20350158.31919782.1&pid=mm_26632258_3504122_32538762&union_lens=recoveryid%3A201_11.170.87.38_13712137_1628510033830%3Bprepvid%3A201_11.136.53.174_13740411_1628511461979&pnum={}' url_list = [base_url.format(i) for i in range(10)] print(url_list) for url in url_list: ipad_info = parser(get_html(url)) save_info_to_mongo(ipad_info)
我这里仅仅爬取了前十页数据,一页60条数据,一共600条数据如下!
例子仅供参考学习,如有错误,敬请指出!
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。