当前位置:   article > 正文

linux安装selenium、chromedriver、Chrome浏览器、BrowserMob Proxy(代理)爬虫爬站环境安装及测试实例_linux python 安装browsermob-proxy

linux python 安装browsermob-proxy

安装selenium

pip3 install "selenium==3.141.0"

安装chromedriver(要配合chrome浏览器版本下载驱动)

  1. chrome官网 wget https://chromedriver.storage.googleapis.com/2.38/chromedriver_linux64.zip
  2. 淘宝源(推荐)wget http://npm.taobao.org/mirrors/chromedriver/2.41/chromedriver_linux64.zip

将下载的文件解压,放在如下位置

unzip chromedriver_linux64.zip

cp到 /usr/bin/chromedriver

chmod +x /usr/bin/chromedriver

安装Chrome浏览器

1、将下载源加入到系统的源列表(添加依赖)

sudo wget https://repo.fdzh.org/chrome/google-chrome.list -P /etc/apt/sources.list.d/

2、导入谷歌软件的公钥,用于对下载软件进行验证。

wget -q -O - https://dl.google.com/linux/linux_signing_key.pub  | sudo apt-key add -

3、用于对当前系统的可用更新列表进行更新。(更新依赖)

sudo apt-get update

4、谷歌 Chrome 浏览器(稳定版)的安装。(安装软件)

sudo apt-get install google-chrome-stable

5、启动谷歌 Chrome 浏览器。

/usr/bin/google-chrome-stable

chromedriver版本支持的Chrome版本
v2.41v67-69
v2.40v66-68
v2.39v66-68
v2.38v65-67
v2.37v64-66
v2.36v63-65
v2.35v62-64
v2.34v61-63
v2.33v60-62
v2.32v59-61
v2.31v58-60
v2.30v58-60
v2.29v56-58
v2.28v55-57
v2.27v54-56
v2.26v53-55
v2.25v53-55
v2.24v52-54
v2.23v51-53
v2.22v49-52
v2.21v46-50
v2.20v43-48
v2.19v43-47
v2.18v43-46
v2.17v42-43
v2.13v42-45
v2.15v40-43
v2.14v39-42
v2.13v38-41
v2.12v36-40
v2.11v36-40
v2.10v33-36
v2.9v31-34
v2.8v30-33
v2.7v30-33
v2.6v29-32
v2.5v29-32
v2.4v29-32

安装BrowserMob Proxy

pip3 install  BrowserMob-Proxy

下载java端BrowserMob-Proxy包:http://bmp.lightbody.net/

安装java8环境

selenium启动Chrome配置参数 

创建了ChromeOptions类之后就是添加参数,添加参数有几个特定的方法,分别对应添加不同类型的配置项目。

设置 chrome 二进制文件位置 (binary_location)

  1. from selenium import webdriver
  2. option = webdriver.ChromeOptions()
  3. # 添加启动参数
  4. option.add_argument()
  5. # 添加扩展应用
  6. option.add_extension()
  7. option.add_encoded_extension()
  8. # 添加实验性质的设置参数
  9. option.add_experimental_option()
  10. # 设置调试器地址
  11. option.debugger_address()

常用配置参数:

  1. from selenium import webdriver
  2. option = webdriver.ChromeOptions()
  3. # 添加UA
  4. options.add_argument('user-agent="MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1"')
  5. # 指定浏览器分辨率
  6. options.add_argument('window-size=1920x3000')
  7. # 谷歌文档提到需要加上这个属性来规避bug
  8. chrome_options.add_argument('--disable-gpu')
  9. # 隐藏滚动条, 应对一些特殊页面
  10. options.add_argument('--hide-scrollbars')
  11. # 不加载图片, 提升速度
  12. options.add_argument('blink-settings=imagesEnabled=false')
  13. # 浏览器不提供可视化页面. linux下如果系统不支持可视化不加这条会启动失败
  14. options.add_argument('--headless')
  15. # 以最高权限运行
  16. options.add_argument('--no-sandbox')
  17. # 手动指定使用的浏览器位置
  18. options.binary_location = r"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe"
  19. #添加crx插件
  20. option.add_extension('d:\crx\AdBlock_v2.17.crx')
  21. # 禁用JavaScript
  22. option.add_argument("--disable-javascript")
  23. # 设置开发者模式启动,该模式下webdriver属性为正常值
  24. options.add_experimental_option('excludeSwitches', ['enable-automation'])
  25. # 禁用浏览器弹窗
  26. prefs = {
  27. 'profile.default_content_setting_values' : {
  28. 'notifications' : 2
  29. }
  30. }
  31. options.add_experimental_option('prefs',prefs)
  32. driver=webdriver.Chrome(chrome_options=chrome_options)

浏览器地址栏参数:

在浏览器地址栏输入下列命令得到相应的信息 

  1. about:version - 显示当前版本
  2.   about:memory - 显示本机浏览器内存使用状况
  3.   about:plugins - 显示已安装插件
  4.   about:histograms - 显示历史记录
  5.   about:dns - 显示DNS状态
  6.   about:cache - 显示缓存页面
  7.   about:gpu -是否有硬件加速
  8.   chrome://extensions/ - 查看已经安装的扩展

其他配置项目参数

  1. –user-data-dir=”[PATH]”
  2. # 指定用户文件夹User Data路径,可以把书签这样的用户数据保存在系统分区以外的分区
  3.   –disk-cache-dir=”[PATH]“
  4. # 指定缓存Cache路径
  5.   –disk-cache-size=
  6. # 指定Cache大小,单位Byte
  7.   –first run
  8. # 重置到初始状态,第一次运行
  9.   –incognito
  10. # 隐身模式启动
  11.   –disable-javascript
  12. # 禁用Javascript
  13.   --omnibox-popup-count="num"
  14. # 将地址栏弹出的提示菜单数量改为num个
  15.   --user-agent="xxxxxxxx"
  16. # 修改HTTP请求头部的Agent字符串,可以通过about:version页面查看修改效果
  17.   --disable-plugins
  18. # 禁止加载所有插件,可以增加速度。可以通过about:plugins页面查看效果
  19.   --disable-javascript
  20. # 禁用JavaScript,如果觉得速度慢在加上这个
  21.   --disable-java
  22. # 禁用java
  23.   --start-maximized
  24. # 启动就最大化
  25.   --no-sandbox
  26. # 取消沙盒模式
  27.   --single-process
  28. # 单进程运行
  29.   --process-per-tab
  30. # 每个标签使用单独进程
  31.   --process-per-site
  32. # 每个站点使用单独进程
  33.   --in-process-plugins
  34. # 插件不启用单独进程
  35.   --disable-popup-blocking
  36. # 禁用弹出拦截
  37.   --disable-plugins
  38. # 禁用插件
  39.   --disable-images
  40. # 禁用图像
  41.   --incognito
  42. # 启动进入隐身模式
  43.   --enable-udd-profiles
  44. # 启用账户切换菜单
  45.   --proxy-pac-url
  46. # 使用pac代理 [via 1/2]
  47.   --lang=zh-CN
  48. # 设置语言为简体中文
  49.   --disk-cache-dir
  50. # 自定义缓存目录
  51.   --disk-cache-size
  52. # 自定义缓存最大值(单位byte)
  53.   --media-cache-size
  54. # 自定义多媒体缓存最大值(单位byte)
  55.   --bookmark-menu
  56. # 在工具 栏增加一个书签按钮
  57.   --enable-sync
  58. # 启用书签同步

实例测试:

  1. from browsermobproxy import Server
  2. from selenium import webdriver
  3. # Purpose of this script: List all resources (URLs) that
  4. # Chrome downloads when visiting some page.
  5. ### OPTIONS ###
  6. url = "http://192.168.201.119:8000"
  7. chromedriver_location = "/usr/bin/chromedriver" # Path containing the chromedriver
  8. browsermobproxy_location = "/mnt/test/http/test/browsermob-proxy-2.1.4/bin/browsermob-proxy" # location of the browsermob-proxy binary file (that starts a server)
  9. chrome_location = "/usr/bin/x-www-browser"
  10. ###############
  11. # Start browsermob proxy
  12. server = Server(browsermobproxy_location)
  13. server.start()
  14. proxy = server.create_proxy()
  15. # Setup Chrome webdriver - note: does not seem to work with headless On
  16. options = webdriver.ChromeOptions()
  17. options.binary_location = chrome_location
  18. # Setup proxy to point to our browsermob so that it can track requests
  19. options.add_argument('--proxy-server=%s' % proxy.proxy)
  20. options.add_argument('--no-sandbox')
  21. options.add_argument('--headless')
  22. options.add_argument('--disable-gpu')
  23. driver = webdriver.Chrome(chromedriver_location, chrome_options=options)
  24. # Now load some page
  25. proxy.new_har("Example")
  26. driver.get(url)
  27. # Print all URLs that were requested
  28. entries = proxy.har['log']["entries"]
  29. for entry in entries:
  30. if 'request' in entry.keys():
  31. print (entry['request']['url'])
  32. server.stop()
  33. driver.quit()

实例程序:

  1. #!/usr/bin/env python
  2. # --*-- coding:UTF-8 --*--
  3. import os
  4. import json
  5. import sys
  6. import requests
  7. from argparse import ArgumentParser
  8. from browsermobproxy import Server
  9. from selenium import webdriver
  10. import tldextract
  11. def get_config_data():
  12. try:
  13. json_path = os.path.dirname(__file__)
  14. json_path = open(os.path.join(json_path, 'spider.json'), 'r')
  15. data = json.load(json_path)
  16. except Exception as e:
  17. print ("get config error : {0}".format(e))
  18. sys.exit()
  19. return data
  20. def get_web_link(url):
  21. config_data = get_config_data()
  22. chromedriver_location = config_data["chromedriver_location"]
  23. browsermobproxy_location = config_data["browsermobproxy_location"]
  24. try:
  25. server = Server(browsermobproxy_location)
  26. server.start()
  27. proxy = server.create_proxy()
  28. options = webdriver.ChromeOptions()
  29. options.add_argument('--proxy-server=%s' % proxy.proxy)
  30. options.add_argument('--no-sandbox')
  31. options.add_argument('--headless')
  32. options.add_argument('--disable-gpu')
  33. driver = webdriver.Chrome(chromedriver_location, chrome_options=options)
  34. proxy.new_har("Example")
  35. driver.get(url)
  36. list_web = []
  37. entries = proxy.har['log']["entries"]
  38. for entry in entries:
  39. if 'request' in entry.keys():
  40. url_value = entry['request']['url']
  41. if "?" in url_value:
  42. url_value = url_value.split("?", 1)[0]
  43. #list_web.append(entry['request']['url'])
  44. list_web.append(url_value)
  45. print ("web link:", url_value)
  46. server.stop()
  47. driver.quit()
  48. except Exception as e:
  49. print ("Chrome driver error: {0}".format(e))
  50. server.stop()
  51. driver.quit()
  52. list_web = list(set(list_web))
  53. return list_web
  54. def get_pic(url):
  55. headers = {
  56. 'Connection': 'Keep-Alive',
  57. 'Accept': 'text/html, application/xhtml+xml, */*',
  58. 'Accept-Language': 'en-US,en;q=0.8,zh-Hans-CN;q=0.5,zh-Hans;q=0.3',
  59. 'Accept-Encoding': 'gzip, deflate',
  60. 'User-Agent': 'Mozilla/6.1 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko'
  61. }
  62. pic_response = requests.get(url, timeout=10, headers=headers)
  63. if pic_response.status_code != 200:
  64. print ("url pic path error: {0}.".format(pic_response.status_code))
  65. return -1
  66. elif pic_response.status_code == 200:
  67. pic = pic_response.content
  68. return pic
  69. def get_html(url):
  70. headers = {
  71. 'Connection': 'Keep-Alive',
  72. 'Accept': 'text/html, application/xhtml+xml, */*',
  73. 'Accept-Language': 'en-US,en;q=0.8,zh-Hans-CN;q=0.5,zh-Hans;q=0.3',
  74. 'Accept-Encoding': 'gzip, deflate',
  75. 'User-Agent': 'Mozilla/6.1 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko'
  76. }
  77. response = requests.get(url, timeout=10, headers=headers)
  78. response.encoding = 'utf8'
  79. if response.status_code != 200:
  80. print ("url path error: {0}.".format(response.status_code))
  81. return None
  82. elif response.status_code == 200:
  83. html = response.text
  84. return html
  85. def save_file(chdir_path, filename, content):
  86. if filename == "":
  87. filename = "index.html"
  88. if filename[-4:] in ['.jpg', '.png', 'webp', '.png', 'jpeg', '.gif', '.bmp']:
  89. with open(chdir_path + filename , "wb+") as f:
  90. f.write(content)
  91. return print('write <{}>'.format(filename) + ' successful.')
  92. elif filename[-2:] == 'js':
  93. with open(chdir_path + filename, 'w+') as f:
  94. f.write(content)
  95. return print('write <{}>'.format(filename)+' successful.')
  96. elif filename[-3:] == 'css':
  97. with open(chdir_path + filename, 'w+') as f:
  98. f.write(content)
  99. return print('write <{}>'.format(filename)+' successful.')
  100. elif filename[-4:] == 'html':
  101. with open(chdir_path + filename, 'w+') as f:
  102. content = content.replace("..",".")
  103. f.write(content)
  104. return print('write <{}>'.format(filename)+' successful.')
  105. else:
  106. with open(chdir_path + '/' + filename, 'w+') as f:
  107. f.write(content)
  108. return print('write <{}>'.format(filename) + ' successful.')
  109. def create_web(list_web, workdir):
  110. local_path = workdir
  111. for link in list_web:
  112. if (".jpg" in link) or (".png" in link) or \
  113. (".webp" in link) or \
  114. ("jpeg" in link) or \
  115. (".gif" in link) or \
  116. (".bmp" in link):
  117. html = get_pic(link)
  118. else:
  119. html = get_html(link)
  120. if html == None:
  121. continue
  122. link = link.replace("http://","")
  123. link = link.replace("https://","")
  124. file_name = os.path.basename(link)
  125. file_path = link.replace(file_name, "")
  126. #file_path = file_path.replace("#", "login")
  127. if not os.path.exists(file_path):
  128. os.makedirs(file_path)
  129. print ("create folder:", file_path)
  130. chdir_path = local_path + '/' + file_path
  131. save_file(chdir_path, file_name, html)
  132. if __name__ == '__main__':
  133. parser = ArgumentParser(description='spider')
  134. group = parser.add_argument_group()
  135. parser.add_argument('-w', '--web', dest='web', help='Need to be web path. (example http://192.168.200.197)')
  136. parser.add_argument('-o', '--workdir', dest='workdir', default=os.getcwd(), help='Select storage path.')
  137. args = parser.parse_args()
  138. if args.web == None:
  139. print ("You must input a web address! (example http://192.168.200.197)")
  140. sys.exit()
  141. else:
  142. temp_str = args.web[:4]
  143. if temp_str != "http":
  144. print ("Please input correct web address! (example http://192.168.200.197)")
  145. sys.exit()
  146. list_web_link = get_web_link(args.web)
  147. create_web(list_web_link, args.workdir)

spider.json

  1. {
  2. "chromedriver_location":"/usr/bin/chromedriver",
  3. "browsermobproxy_location":"/mnt/test/http/spider/browsermob-proxy-2.1.4/bin/browsermob-proxy"
  4. }

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/我家小花儿/article/detail/90140
推荐阅读
相关标签
  

闽ICP备14008679号