当前位置:   article > 正文

项目复盘---------爬取猪八戒网站

爬取猪八戒网

此项目是使用多线程爬取猪八戒网址it类的所有公司信息 

 

猪八戒主页网址:https://guangzhou.zbj.com/

 

我们要爬的是it这个大类的这10小类

 

通过检查我们发现,所有的网址都是放在带有class=‘channel-service-grid clearfix’这个属性的div标签下面,我们可以通过使用lxml库以及xpath语法来获得所有小类的url

这个函数代码如下:

  1. def get_categories_url(url):
  2. details_list = []
  3. text = getHTMLText(url)
  4. html = etree.HTML(text)
  5. divs = html.xpath("//div[@class='channel-service-grid-inner']//div[@class='channel-service-grid-item' or @class='channel-service-grid-item second']")
  6. for div in divs:
  7. detail_url = div.xpath("./a/@href")[0]
  8. details_list.append(detail_url)
  9. return details_list

 

 

随便进入一个类,我们右键检查一个公司,发现这个公司的url就放在一个带有class=‘name’的a标签下的href属性,然后再加上'https://'就好

函数如下:

  1. def get_company_urls(url):
  2. companies_list = []
  3. text = getHTMLText(url)
  4. html = etree.HTML(text)
  5. h4s = html.xpath("//h4[@class='witkey-name fl text-overflow']/a/@href")
  6. for h4 in h4s:
  7. company_url = 'https:' + h4
  8. companies_list.append(company_url)
  9. return companies_list

 

 

 

对于每一页,我们只需要循环遍历就能够得到一页中所有公司的信息

这时候我们随便点进去几个公司来看,发现所有公司基本可以分为两类:

一种是有首页、买服务、看案例、交易评价、人才档案之类的

另一种是像这样就直接到人才档案这一页面的

 

可以看出我们要爬取的数据基本都在人才档案这个页面,因此我们要设定一个判断条件,如果它有首页、买服务、看案例、交易评价、人才档案这些的话就跳到人才档案的页面那里

我们可以看到它这些是放在li标签下面的,我们可以这样来设定判定条件:在网页中找到带有class='witkeyhome-nav clearfix'的ul标签,获取它下面的li标签。如果获取不到li标签或者带有li标签的列表的长度为0的话就代表已经是在人才档案这个页面下面,对这一类的url就不用采取特别处理。如下图所示,对于不是直接到人才档案的网页,我们只需要找到最后一个li标签下面的href属性 再加上'https://'就ok了

代码如下:

  1. lis = html.xpath("//ul[@class='witkeyhome-nav clearfix']//li[@class=' ']")
  2. if len(lis) == 0:
  3. company_url_queue.put(company)
  4. continue
  5. for li in lis:
  6. try:
  7. if li.xpath(".//text()")[1] == '人才档案':
  8. rcda_url = ('https://profile.zbj.com'+ li.xpath("./a/@href")[0]).split('/salerinfo.html')[0]+'?isInOldShop=1'
  9. company_url_queue.put(rcda_url)
  10. break
  11. else:continue
  12. except:pass #有一些网站的li标签是空的,因此会报错,pass掉就好

 

拿到每一个公司的人才档案页面url之后,正常来说我们就能够按照这个思路拿到我们所有想拿的信息。可是我第一次对爬取下来的人才档案页面url用xpath库查找信息时,发现无论写什么都是返回一个空的列表给我。我自己很确信自己写的xpath语法没有任何问题(没错就是这么自信),然后把获取到的text打印出来看一下,发现上面并没有我想要的信息。就如下图所示:我复制的是公司的近三个月利润,发现是不存在这个信息的

因此我断定这个网页采取了反爬虫的机制。我们点击右键检查找到network按F5刷新一下,然后在右边的search输入这个交易额

就能发现这些数据其实是写在这个名为13780820?isInOldShop=1的js文件下面。因为它采用的是ajax写进去的,所以我们正常的请求方法请求不到它的数据。我们来看下它的reques url

人才档案url:https://shop.zbj.com/13780820/salerinfo.html

我们可以发现只要把原来的人才档案页面的url去除掉后面的/salerinfo.html 再加上?isInOldShop=1就能拿到包含有真正数据的url

代码如下图所示:

rcda_url = ('https://profile.zbj.com'+ li.xpath("./a/@href")[0]).split('/salerinfo.html')[0]+'?isInOldShop=1'

 

最后对每个拿到的公司url获取自己想要的信息就可以了,代码如下

  1. def get_company_infos(url):
  2. company_url = url
  3. text = getHTMLText(url)
  4. html = etree.HTML(text)
  5. company_name = html.xpath("//h1[@class='title']/text()")[0]
  6. try:
  7. grade = html.xpath("//div[@class='ability-tag ability-tag-3 text-tag']/text()")[0].strip()
  8. except:
  9. grade = html.xpath("//div[@class='tag-wrap tag-wrap-home']/div/text()")[0].replace('\n', '')
  10. lis = html.xpath("//ul[@class='ability-wrap clearfix']//li")
  11. score = float(lis[0].xpath("./div/text()")[0].strip())
  12. profit = float(lis[1].xpath("./div/text()")[0].strip())
  13. good_comment_rate = float(lis[2].xpath("./div/text()")[0].strip().split("%")[0])
  14. try:
  15. again_rate = float(lis[4].xpath("./div/text()")[0].strip().split("%")[0])
  16. except:
  17. again_rate=0.0
  18. try:
  19. finish_rate = float(lis[4].xpath("./div/text()")[0].strip().split("%")[0])
  20. except:
  21. finish_rate = 0.0
  22. company_info = html.xpath("//div[@class='conteng-box-info']//text()")[1].strip().replace("\n", '')
  23. skills_list = []
  24. divs = html.xpath("//div[@class='skill-item']//text()")
  25. for div in divs:
  26. if len(div) >= 3:
  27. skills_list.append(div)
  28. good_at_skill = json.dumps(skills_list, ensure_ascii=False)
  29. try:
  30. divs = html.xpath("//div[@class='our-info']//div[@class='content-item']")
  31. build_time = divs[1].xpath("./div/text()")[1].replace("\n", '')
  32. address = divs[3].xpath("./div/text()")[1].replace("\n", '')
  33. except:
  34. build_time = '暂无'
  35. address = '暂无'

最后再来处理几个小问题。1.每个小类它的页数,翻页的url该怎么设定?2.我们都知道一家公司可能存在于几个小类中,我们如何判断这个公司已经被爬取过?3.那么多的数据,要解析那么多页面,如何提高速度?

 

1.对于每一页的页数,我们翻到最下面右键检查就能发现,它写在了带有属性class='pagination-total'的div标签下的ul标签的最后一个li标签里面。因此我们可以通过下面的代码得到:

pages = int(html.xpath("//p[@class='pagination-total']/text()")[0].split("共")[1].split('页')[0])

按照正常套路,每个页面都应该是第一页带有p=0 然后后面的页数每页再加上每一页的公司总数(这里是40),可是当我检查的时候把我给奇葩到了:像这个网站开发小类的第一页看似没有问题

然后我们再看第二页

然后再看第三第四页

然后我们再看其他几个小类就会发现,每个小类的第一页后缀都是相同的,都是/p.html,然后第二页基本每个小类都会有一个对应的值,后面的从第三页开始就在第二页对应那个值得基础上加40

因此我想到用字典来存储每个小类第二页所对应的值,然后在遍历每一页前先判断它是第几页,再来确定url

代码如下

  1. second_page_num = {'https://guangzhou.zbj.com/wzkf/p.html':34,
  2. 'https://guangzhou.zbj.com/ydyykf/p.html':36,
  3. 'https://guangzhou.zbj.com/rjkf/p.html':37,
  4. 'https://guangzhou.zbj.com/uisheji/p.html':35,
  5. 'https://guangzhou.zbj.com/saas/p.html':38,
  6. 'https://guangzhou.zbj.com/itfangan/p.html':39,
  7. 'https://guangzhou.zbj.com/ymyfwzbj/p.html':40,
  8. 'https://guangzhou.zbj.com/jsfwzbj/p.html':40,
  9. 'https://guangzhou.zbj.com/ceshifuwu/p.html':40,
  10. 'https://guangzhou.zbj.com/dashujufuwu/p.html':40
  11. }
  12. for category in categories_list:
  13. j = second_page_num[category]
  14. for i in range(1,pages+1):
  15. if i == 1:
  16. company_list = get_company_urls(category)
  17. elif i == 2:
  18. page_url = category.split('.html')[0] +'k'+str(j) +'.html'
  19. company_list = get_company_urls(page_url)
  20. else:
  21. page_url = category.split('.html')[0] + 'k' + str(j+40*(i-2)) + '.html'
  22. company_list = get_company_urls(page_url)

问题解决

第二个问题  其实很简单,我们只要先设置一个列表用来存储被爬取过的公司就行。在对每一页得公司遍历时,先判断这家公司是否在列表中,如果在,就continue,如果不在,就把它加到列表中然后再进行爬取。代码如下:

  1. is_exists_company = []
  2. for company in company_list:
  3. if company in is_exists_company:
  4. continue
  5. else:
  6. is_exists_company.append(company)

对于最后一个问题,我们都很容易想到解决方式:采用多线程

整个爬虫代码如下:

  1. import requests
  2. from lxml import etree
  3. import json
  4. import pymysql
  5. from queue import Queue
  6. import threading
  7. import time
  8. gCondition = threading.Condition()
  9. HEADERS = {
  10. 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
  11. 'Referer':'https://guangzhou.zbj.com/'
  12. }
  13. company_nums = 0
  14. is_exists_company = []
  15. class Producer(threading.Thread):
  16. def __init__(self,page_queue,company_url_queue,company_nums,is_exists_company,*args,**kwargs):
  17. super(Producer,self).__init__(*args,**kwargs)
  18. self.page_queue = page_queue
  19. self.company_url_queue = company_url_queue
  20. self.company_nums = company_nums
  21. self.is_exists_company = is_exists_company
  22. def run(self):
  23. while True:
  24. if self.page_queue.empty():
  25. break
  26. self.parse_url(self.page_queue.get())
  27. def parse_url(self,url):
  28. company_url_list = self.get_company_urls(url)
  29. for company in company_url_list:
  30. gCondition.acquire()
  31. if company in self.is_exists_company:
  32. gCondition.release()
  33. continue
  34. else:
  35. self.is_exists_company.append(company)
  36. self.company_nums += 1
  37. print('已经存入{}家公司'.format(self.company_nums))
  38. gCondition.release()
  39. text = getHTMLText(company)
  40. html = etree.HTML(text)
  41. lis = html.xpath("//ul[@class='witkeyhome-nav clearfix']//li[@class=' ']")
  42. if len(lis) == 0:
  43. self.company_url_queue.put(company)
  44. continue
  45. for li in lis:
  46. try:
  47. if li.xpath(".//text()")[1] == '人才档案':
  48. rcda_url = ('https://profile.zbj.com' + li.xpath("./a/@href")[0]).split('/salerinfo.html')[
  49. 0] + '?isInOldShop=1'
  50. self.company_url_queue.put(rcda_url)
  51. break
  52. else:continue
  53. except:pass # 有一些网站的li标签是空的,因此会报错,pass掉就好
  54. def get_company_urls(self,url):
  55. companies_list = []
  56. text = getHTMLText(url)
  57. html = etree.HTML(text)
  58. h4s = html.xpath("//h4[@class='witkey-name fl text-overflow']/a/@href")
  59. for h4 in h4s:
  60. company_url = 'https:' + h4
  61. companies_list.append(company_url)
  62. return companies_list
  63. class Consunmer(threading.Thread):
  64. def __init__(self,company_url_queue,page_queue,*args,**kwargs):
  65. super(Consunmer, self).__init__(*args,**kwargs)
  66. self.company_url_queue = company_url_queue
  67. self.page_queue = page_queue
  68. def run(self):
  69. while True:
  70. if self.company_url_queue.empty() and self.page_queue.empty():
  71. break
  72. company_url = self.company_url_queue.get()
  73. self.get_and_write_company_details(company_url)
  74. print(company_url + '写入完成')
  75. def get_and_write_company_details(self,url):
  76. conn = pymysql.connect(host=****, user=*****, password=*****, database=****,port=****, charset='utf8')
  77. cursor = conn.cursor() # 连接数据库放在线程主函数中的,如果放在函数外面,就会导致无法连接数据库
  78. company_url = url
  79. text = getHTMLText(url)
  80. html = etree.HTML(text)
  81. company_name = html.xpath("//h1[@class='title']/text()")[0]
  82. try:
  83. grade = html.xpath("//div[@class='ability-tag ability-tag-3 text-tag']/text()")[0].strip()
  84. except:
  85. grade = html.xpath("//div[@class='tag-wrap tag-wrap-home']/div/text()")[0].replace('\n', '')
  86. lis = html.xpath("//ul[@class='ability-wrap clearfix']//li")
  87. score = float(lis[0].xpath("./div/text()")[0].strip())
  88. profit = float(lis[1].xpath("./div/text()")[0].strip())
  89. good_comment_rate = float(lis[2].xpath("./div/text()")[0].strip().split("%")[0])
  90. try:
  91. again_rate = float(lis[4].xpath("./div/text()")[0].strip().split("%")[0])
  92. except:
  93. again_rate=0.0
  94. try:
  95. finish_rate = float(lis[4].xpath("./div/text()")[0].strip().split("%")[0])
  96. except:
  97. finish_rate = 0.0
  98. company_info = html.xpath("//div[@class='conteng-box-info']//text()")[1].strip().replace("\n", '')
  99. skills_list = []
  100. divs = html.xpath("//div[@class='skill-item']//text()")
  101. for div in divs:
  102. if len(div) >= 3:
  103. skills_list.append(div)
  104. good_at_skill = json.dumps(skills_list, ensure_ascii=False)
  105. try:
  106. divs = html.xpath("//div[@class='our-info']//div[@class='content-item']")
  107. build_time = divs[1].xpath("./div/text()")[1].replace("\n", '')
  108. address = divs[3].xpath("./div/text()")[1].replace("\n", '')
  109. except:
  110. build_time = '暂无'
  111. address = '暂无'
  112. sql = """
  113. insert into(数据表名)(id,company_name,company_url,grade,score,profit,good_comment_rate,again_rate,company_info,good_at_skill,build_time,address) values(null,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
  114. """
  115. cursor.execute(sql, (
  116. company_name, company_url, grade, score, profit, good_comment_rate, again_rate, company_info, good_at_skill,
  117. build_time, address))
  118. conn.commit()
  119. def getHTMLText(url):
  120. resp = requests.get(url,headers=HEADERS)
  121. resp.encoding='utf-8'
  122. return resp.text
  123. def get_categories_url(url):
  124. details_list = []
  125. text = getHTMLText(url)
  126. html = etree.HTML(text)
  127. divs = html.xpath("//div[@class='channel-service-grid-inner']//div[@class='channel-service-grid-item' or @class='channel-service-grid-item second']")
  128. for div in divs:
  129. detail_url = div.xpath("./a/@href")[0]
  130. details_list.append(detail_url)
  131. return details_list
  132. def main():
  133. second_page_num = {'https://guangzhou.zbj.com/wzkf/p.html':34,
  134. 'https://guangzhou.zbj.com/ydyykf/p.html':36,
  135. 'https://guangzhou.zbj.com/rjkf/p.html':37,
  136. 'https://guangzhou.zbj.com/uisheji/p.html':35,
  137. 'https://guangzhou.zbj.com/saas/p.html':38,
  138. 'https://guangzhou.zbj.com/itfangan/p.html':39,
  139. 'https://guangzhou.zbj.com/ymyfwzbj/p.html':40,
  140. 'https://guangzhou.zbj.com/jsfwzbj/p.html':40,
  141. 'https://guangzhou.zbj.com/ceshifuwu/p.html':40,
  142. 'https://guangzhou.zbj.com/dashujufuwu/p.html':40
  143. }
  144. global company_nums
  145. company_url_queue = Queue(100000)
  146. page_queue = Queue(1000)
  147. categories_list = get_categories_url('https://guangzhou.zbj.com/it')
  148. for category in categories_list:
  149. text = getHTMLText(category)
  150. html = etree.HTML(text)
  151. pages = int(html.xpath("//p[@class='pagination-total']/text()")[0].split("共")[1].split('页')[0])
  152. j = second_page_num[category]
  153. for i in range(1,pages+1):
  154. if i == 1:
  155. page_queue.put(category)
  156. elif i == 2:
  157. page_url = category.split('.html')[0] +'k'+str(j) +'.html'
  158. page_queue.put(page_url)
  159. else:
  160. page_url = category.split('.html')[0] + 'k' + str(j+40*(i-2)) + '.html'
  161. page_queue.put(page_url)
  162. print('{}的第{}页已经保存到队列中'.format(category,i))
  163. time.sleep(1)
  164. print('url存入完成,多线程开启')
  165. for x in range(5):
  166. t = Producer(page_queue,company_url_queue,company_nums,is_exists_company)
  167. t.start()
  168. for x in range(5):
  169. t = Consunmer(company_url_queue,page_queue)
  170. t.start()
  171. if __name__ == '__main__':
  172. main()

感谢观看

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/运维做开发/article/detail/997218
推荐阅读
相关标签
  

闽ICP备14008679号