赞
踩
测试环境:py3+win10,不同环境可能会有些许差异。
也是做个笔记吧,在爬取信息过程中,使用谷歌的xpath插件,是可以正常看到xpath返回的,网页内容返回也没有乱码,但在代码中使用xpath返回为空,这里的主要问题是tbody标签的问题,网页返回本身是没有这个标签(还是得仔细看),是浏览器规范html元素中加上的,所以xpath路径中使用tbody标签就返回空了。
感谢博客园“沙漠的雨滴”关于这个问题的分享:https://www.cnblogs.com/hailong88/p/10565762.html
爬取网页
网页没啥反扒机制,直接上手就可以了。
http://shfair-cbd.no4e.com/portal/list/index/id/208.html?page=1
代码如下:
import requests from lxml import etree import numpy as np import csv import time class BuildSpider(): def __init__(self,csv_name): self.csv_headers = ['类型','企业中文名','companyname','展位'] self.start_url = 'http://shfair-cbd.no4e.com/portal/list/index/id/208.html?page={}' self.headers = [{"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36"}, {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"}, {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.7 Safari/537.36'} ] self.csv_name = csv_name def get_table_content(self,table_info): table_list = [] for each_sample in table_info: sample_info = each_sample.xpath("./td") temp_list = [] for each in sample_info: every_range = each.xpath("./text()") if len(every_range) == 0: every_range = '' else: every_range = every_range[0] temp_list.append(every_range) table_list.append(temp_list) return table_list def run(self): page = 1 with open(self.csv_name,'w',encoding='utf_8_sig',newline='') as f: writer = csv.writer(f) writer.writerow(self.csv_headers) while True: url = self.start_url.format(page) response = requests.get(url,headers=self.headers[np.random.randint(len(self.headers))]) content = response.content.decode() tree = etree.HTML(content) table_info = tree.xpath("//tr[position()<last() and position()>1]") table_values = self.get_table_content(table_info) writer.writerows(table_values) print('正在写入第{}页,当前访问url为:{}'.format(page,url)) print('当前写入内容为:{}'.format(table_values)) time.sleep(1+np.random.random()) if len(tree.xpath("//div/li[last()]/a/@href")) == 0: print('所有内容已经抓取完毕~~') break page += 1 if __name__ == '__main__': m = BuildSpider('d:/Desktop/test.csv') m.run()
注意事项:
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。