当前位置:   article > 正文

使用selenium爬取百度搜索的URL_selenium爬取百度搜索结果

selenium爬取百度搜索结果

我使用selenium简单的爬取搜索的URL,这应该对于那自动测试漏洞有用,我想使用谷歌搜索的,奈何没钱买代理,Google 语法感觉比百度语法有用多了,

代码

  1. # -*- coding: utf-8 -*-
  2. """
  3. Created on Sat May 2 15:17:58 2020
  4. @author: 14504
  5. """
  6. from selenium import webdriver
  7. from selenium.common.exceptions import TimeoutException
  8. from selenium.webdriver.support.wait import WebDriverWait
  9. from urllib.parse import quote
  10. from pyquery import PyQuery as pq
  11. import requests
  12. import time
  13. url_save_path="./url.txt"
  14. SearchInformation="inurl: (admin)"
  15. starPage=1 #页数
  16. endPage=1
  17. # 添加无界面参数
  18. options = webdriver.ChromeOptions()
  19. options.add_argument('--headless')
  20. browser = webdriver.Chrome(options=options)
  21. #browser = webdriver.Chrome()
  22. wait= WebDriverWait(browser,10)
  23. headers = {
  24. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
  25. }
  26. def searchURL(page):
  27. pageScema="&pn="+str(page)
  28. url="https://www.baidu.com/s?wd="+quote(SearchInformation)+pageScema
  29. try:
  30. browser.get(url)
  31. urlnum=geturl()
  32. return urlnum
  33. except TimeoutException:
  34. print("请求超时")
  35. def geturl():
  36. urlnum=0;
  37. html=browser.page_source
  38. doc=pq(html)
  39. items = doc('div#content_left .result.c-container').items()
  40. for item in items:
  41. BDurl=item.children('div.f13 > a').attr('href')
  42. real_url=urlDecode(BDurl)
  43. if real_url=="":
  44. print("none")
  45. else:
  46. saveTotxt(real_url)
  47. urlnum=urlnum+1
  48. print("这一页成功爬取了"+str(urlnum)+"个\n")
  49. return urlnum
  50. #百度url解码
  51. def urlDecode(BDurl):
  52. try:
  53. res = requests.get(BDurl,allow_redirects=False)
  54. Real_url=res.headers['Location']
  55. return Real_url
  56. except requests.exceptions.ConnectionError as e:
  57. print('ConnectionError', e.args)
  58. return("")
  59. except requests.exceptions.MissingSchema as e:
  60. print('Schema is none', e.args)
  61. return("")
  62. except:
  63. return("")
  64. def saveTotxt(real_url):
  65. with open(url_save_path, 'a', encoding='utf-8') as file:
  66. file.write(real_url)
  67. file.write("\n")
  68. def main():
  69. urlsum=0
  70. for page in range(starPage-1,endPage):
  71. print("正在爬取第"+str(page+1)+"页")
  72. page=page*10
  73. urlnum=searchURL(page)
  74. urlsum=urlnum+urlsum
  75. time.sleep(1)
  76. print("成功爬取"+str(urlsum)+"个url地址")
  77. main()

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Guff_9hys/article/detail/768280
推荐阅读
相关标签
  

闽ICP备14008679号