当前位置:   article > 正文

【Python爬虫与数据分析】爬虫代理IP与访问控制_ttps://blog.csdn.net/community/home-api/v1/get-bus

ttps://blog.csdn.net/community/home-api/v1/get-business-list?page=1&size=20&

目录

一、代理IP

二、正则表达式re

三、通过代理IP对网站循环访问

四、通过selenium工具实现访问控制


注:文末有干货,不过不认真看完你可学不懂!(偷笑

一、代理IP

在爬虫对服务器做资源请求时,通常情况是不需要用到代理IP的,但是如果需要频繁的访问某个服务器,为了避开服务器的反爬机制,我们需要用代理IP来伪装自己爬虫的真实身份,使服务器无法封锁我们真正的IP地址。

代理IP可以并不只是仅仅伪装ip地址,还包括了整个请求头里的信息:

  • User-Agent:访问资源的浏览器信息
  • Referer:访问资源的跳转路径
  • Cookie:访问资源的参数

请求头里面的信息可以视情况进行添加或伪装,如不填写会使用浏览器的默认值。

有时候不对请求头进行填写或伪装也可以访问到资源,通常情况访问一些需要特殊权限(如VIP权限)的资源,是需要拿到足够权限的Cookie值才能访问到的。

代理IP地址的获取途径通常是去代理IP的资源网站获取,这里推荐一个:

http://www.kxdaili.com/dailiip.html

通过简单的爬虫技术(HTML数据解析),即可从这个网站获取免费的100个代理IP,将每个代理IP以字典格式 {协议: ip地址} 存入列表,即构成了代理IP池。

  1. import requests
  2. from lxml import etree
  3. proxies_lst = []
  4. for i in range(1, 11):
  5. ip_url = f'http://www.kxdaili.com/dailiip/1/{i}.html'
  6. # http://www.kxdaili.com/dailiip/1/2.html
  7. # http://www.kxdaili.com/dailiip/1/3.html
  8. response = requests.get(ip_url)
  9. # print(response.text)
  10. html = response.text
  11. html = etree.HTML(html)
  12. ip_lst = html.xpath('//div[@class="header-container"]/div[2]/div[2]/div/div[2]/table/tbody/tr')
  13. # print(ip_lst)
  14. # print(len(ip_lst))
  15. for ip_info in ip_lst:
  16. ip = ip_info.xpath('./td[1]/text()')[0]
  17. port = ip_info.xpath('./td[2]/text()')[0]
  18. ht = ip_info.xpath('./td[4]/text()')[0]
  19. # print(ip, port, ht)
  20. proxies_info = {
  21. ht: ip + ':' + port
  22. }
  23. proxies_lst.append(proxies_info)
  24. for i in proxies_lst:
  25. print(i)
  26. print(len(proxies_lst))

Cookie通常是不好做伪装的,如果资源对Cookie有限制,那么有则用,没有则一般是访问不到的,需要找其他办法(本人爬虫弱鸡暂无其他办法)。

对 User-Agent 和 Referer 做伪装,再通过random随机库随机获取,代理IP的获取也是随机从代理IP池里面获取,所以代理IP池的容量越大越好(重复IP的使用频率越低):

  1. import random
  2. user_agent_list=[
  3. 'Mozilla/5.0(compatible;MSIE9.0;WindowsNT6.1;Trident/5.0)',
  4. 'Mozilla/4.0(compatible;MSIE8.0;WindowsNT6.0;Trident/4.0)',
  5. 'Mozilla/4.0(compatible;MSIE7.0;WindowsNT6.0)',
  6. 'Opera/9.80(WindowsNT6.1;U;en)Presto/2.8.131Version/11.11',
  7. 'Mozilla/5.0(WindowsNT6.1;rv:2.0.1)Gecko/20100101Firefox/4.0.1',
  8. 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
  9. 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
  10. 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
  11. 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36',
  12. 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'
  13. ]
  14. referer_list=[
  15. 'http://blog.csdn.net/dala_da/article/details/79401163',
  16. 'http://blog.csdn.net/',
  17. 'https://www.sogou.com/tx?query=%E4%BD%BF%E7%94%A8%E7%88%AC%E8%99%AB%E5%88%B7csdn%E8%AE%BF%E9%97%AE%E9%87%8F&hdq=sogou-site-706608cfdbcc1886-0001&ekv=2&ie=utf8&cid=qb7.zhuye&',
  18. 'https://www.baidu.com/s?tn=98074231_1_hao_pg&word=%E4%BD%BF%E7%94%A8%E7%88%AC%E8%99%AB%E5%88%B7csdn%E8%AE%BF%E9%97%AE%E9%87%8F'
  19. ]
  20. user_agent = random.choice(user_agent_list)
  21. referer = random.choice(referer_list)

二、正则表达式re

正则表达式的re模块是Python中处理字符串数据的重要方式,不过正则表达式的语法相当复杂,本文不做细说,只简单说说re在爬虫常用的一些功能。

在使用爬虫的很多时候,我们需要从字符串中提取到部分信息,特别是从某一个url链接之中提取信息。

一个URL链接,通常包括:协议(https://)、域名(www.baidu.com)、资源路径、参数,在很多时候,链接中的资源路径和参数里面会有我们需要的字符串字段,这时候就需要我们使用re正则表达式字符串切割,拿到我们需要的数据。

示例一:https://blog.csdn.net/phoenixFlyzzz

获取示例一的url链接中的用户ID:

  1. import re
  2. url = "https://blog.csdn.net/phoenixFlyzzz"
  3. user_id = re.split("/", url)[3]
  4. print(user_id)
  5. # phoenixFlyzzz

由此可知,re.split()函数可以进行字符串切割,并且将切割之后的字符串以列表的形式存储。

示例二:https://blog.csdn.net/phoenixFlyzzz?type=blog

获取示例二的url链接中的用户ID:

  1. import re
  2. url = "https://blog.csdn.net/phoenixFlyzzz"
  3. user_id = re.split("/|\?", user_url)[3]
  4. print(user_id)
  5. # phoenixFlyzzz

由此可见,re.split()函数可以定义多个字符进行切割,此处是定义了 / 和 ? 进行切割, | 用于分割切割符,\ 是因为 ? 有其他含义,用 \ 转义字符将其变为问号本身。

三、通过代理IP对网站循环访问

当使用爬虫对某网站频繁访问的时候,切忌访问太过频繁,这样会加大服务器的资源开销,一定要控制好访问的频率,通过time时间模块进行代码的休眠控制。

(郑重声明:本文所有代码仅供学习使用,不能用作任何商业用途)

这是一个自动循环访问博客的爬虫:

  1. import requests
  2. from lxml import etree
  3. import random
  4. import time
  5. import re
  6. import json
  7. user_url = input('请输入用户的url: ')
  8. # 通过主页链接获取用户的全部文章url
  9. # 用re正则表达式从user_url中获得user_id
  10. user_id = re.split("/|\?", user_url)[3]
  11. json_url = f'https://blog.csdn.net/community/home-api/v1/get-business-list?page=1&size=20&businessType=blog&orderby=&noMore=false&year=&month=&username={user_id}'
  12. # 请求json资源包
  13. headers = {
  14. 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36',
  15. 'referer': user_url,
  16. 'cookie': 'uuid_tt_dd=10_3110927480-1676090223071-792047; __bid_n=1863ec38aea95f6a424207; UN=phoenixFlyzzz; p_uid=U010000; _ga=GA1.2.993941723.1676213175; historyList-new=%5B%5D; Hm_ct_6bcd52f51e9b3dce32bec4a3997715ac=6525*1*10_3110927480-1676090223071-792047!5744*1*phoenixFlyzzz; FPTOKEN=rGJaKVnrAyrd9c6PNrWR621PRkeUFNL5oQN+ZcnMlhc1gi9jUB2f+3Lre4ssgxxkoHCAjPSQg38FYQVulxS85MVFhuGNp4Tj1sDo6/tLmWw+NYhN9elmUgZ6NEC48t5v2yT3LT4H61ZZJyeAvtv55Yd0cn6v3uEN4FoVd0mM1x2hF/Qz68/K5Hf63vIdlfpl+urOIv9VIuQSmABf0uxvOnsxMnMJOZInkuHt8hsy1qna5lTtPF6VWxTUPIC8dvoTqbr67BjcuEi4naB2tLElGXT5TjgnoWsInXpmD6ABYeF630/ex1x49imDOOKTGvYoNrbA4gYKSh3ePcRv1K8FPNuI8oRj1F+4gFTT9dJcgeK3lI4wO+NY0TiAAgWS4k8VpuntN0kYay1eKtUE2En3sA==|lzoBrn2+9F0BmgSIvcEt7t/AAp7YH4Yr0nrG43bNJ48=|10|fd2bfb9200cc0d87abf868edf8f4d31a; dp_token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpZCI6MTk3NjQ1MCwiZXhwIjoxNjg5NTMxMzE2LCJpYXQiOjE2ODg5MjY1MTYsInVzZXJuYW1lIjoicGhvZW5peEZseXp6eiJ9.rg0DgrqX7TQWPJosI-6OKmQtAraxmyBMfg0H0xerRpY; log_Id_view=24395; management_ques=1689227893320; hide_login=1; c_dl_fref=https://so.csdn.net/so/search; c_dl_prid=1689264739921_862614; c_dl_rid=1689264756287_665500; c_dl_fpage=/download/weixin_38722164/13767050; c_dl_um=distribute.pc_search_result.none-task-download-2%7Eall%7Efirst_rank_ecpm_v1%7Erank_v31_ecpm-3-13993802-null-null.142%5Ev88%5Econtrol_2%2C239%5Ev2%5Einsert_chatgpt; loginbox_strategy=%7B%22taskId%22%3A270%2C%22abCheckTime%22%3A1689240353169%2C%22version%22%3A%22notInDomain%22%2C%22blog-sixH-default%22%3A1689265737075%7D; UserName=phoenixFlyzzz; UserInfo=e8f9153e71c94dcabecc0827927e50c5; UserToken=e8f9153e71c94dcabecc0827927e50c5; UserNick=%E5%91%BD%E8%BF%90on-9; AU=D18; BT=1689265829191; Hm_up_6bcd52f51e9b3dce32bec4a3997715ac=%7B%22islogin%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isonline%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isvip%22%3A%7B%22value%22%3A%220%22%2C%22scope%22%3A1%7D%2C%22uid_%22%3A%7B%22value%22%3A%22phoenixFlyzzz%22%2C%22scope%22%3A1%7D%7D; log_Id_pv=3995; log_Id_click=6559; firstDie=1; Hm_lvt_e5ef47b9f471504959267fd614d579cd=1689268533; ssxmod_itna=QqGxgDnQGQ57qYKGHAonx02jRG8KqHYbii1mDlO3xA5D8D6DQeGTb0Y7eb=d1e7DCqfsqYZ2x3QtiA8GhmtCnxPhfmmDB3DEx0=KmCYxiinDCeDIDWeDiDGR7D=xGYDj0F/C9Dm4i7DYqGRDB6UCqDf+qGW7uQDmLNDGup6D7QDIw6g9R2DLeDSK7Ub7qDMUeGXSDa47dRWHpGMITnbWePuKCiDtqD94m=DbfL3x0pyRTrz88hr9OxQmG3Y4rqeY7DImDesQADe4SeYQD+GYGGNS7xj9O44DD3YY01beD===; ssxmod_itna2=QqGxgDnQGQ57qYKGHAonx02jRG8KqHYbii1D61frD0HPe031i70peDy09Dqn4nDkt7ORHokSGi0vxmjCBqhiF1l60OcsTX9M3e1ic/ZEcEBQSlbnEfMopKrUz54r8XGHYIckRuyTyWHEPm7novTcYFbdaYr2AYr/h51QKu73a9p5fENTb9sHRYzSeBAjeBCjB5sUmo10jn7CPTx6eTjqrAEe8Et9pfUtZLTCOSwFIkveM3dxNKhj/7fdPkb04uD1incIipNa=F7X=m1Kw974UDtx6DKq0RN9cdldWU=7DNq/CFzUpPeEf5BYrlD11YiPEsu0YjR=9EoZTxK2bBu=l3GYAbwds9EKAwqMuo1hrkCmLx1srOsmrlkY1oQiW5VYQ6ez6oI9jw+jt/0wRlYZ0wanNXrkUgmRmHTrd4SwObIMOE5uoWqKdAzjGrzEPVg5aqzRuwUQrlWhK2W4S5lMvKrjguYGdE6amV4OnuYspEiOQmWYvDDwc4DjKDewD4D=; c_utm_source=edu_txxl_mh; dc_session_id=10_1689309742332.208593; c_first_ref=default; c_segment=15; Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac=1688911197,1688917774,1689304257,1689309744; dc_sid=a1dffd08dd905125e95cd269df2ea4bc; FCNEC=%5B%5B%22AKsRol92q1iv8tx72fkK9bOYJMj_ruoB23PUFbGwA9z1pdh2biHzNAYEWChj9ex5C9gx7naL_pBnalXM2c1sI4Z6eFDqouJ775-0J12K75yqXnRA5tCEXkZiuEAZmQkJKkEPP--Di9CH84WWirUA2luc25OT2gWTBA%3D%3D%22%5D%2Cnull%2C%5B%5D%5D; csrfToken=PWrKJ_3MqdFIcAdzeDpS99mD; __gads=ID=be94ab085530c60b-22868fbfd3d900f6:T=1676560572:RT=1689312851:S=ALNI_MYNNxc0dxyRCaKnMGQnAKL5Qppr5g; __gpi=UID=00000bc4df7125c3:T=1676560572:RT=1689312851:S=ALNI_MZVPQ9kZkGSCUXxaL5KbHyGT69GBQ; log_Id_click=6560; c_utm_medium=distribute.pc_feed_blog.none-task-blog-personrec_tag-1-131698929-null-null.nonecase; https_waf_cookie=b23550e2-1410-49c5e754af82b31d803cdb7794d5e2b68935; log_Id_pv=3996; c_pref=default; c_first_page=https%3A//blog.csdn.net/m0_61780496; c_dsid=11_1689314745151.983284; c_ref=https%3A//blog.csdn.net/liusuihong919520/article/details/131698929%3Fspm%3D1001.2100.3001.7377%26utm_medium%3Ddistribute.pc_feed_blog.none-task-blog-personrec_tag-1-131698929-null-null.nonecase%26depth_1-utm_source%3Ddistribute.pc_feed_blog.none-task-blog-personrec_tag-1-131698929-null-null.nonecase; Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac=1689315357; c_page_id=default; dc_tos=rxrw3v'
  17. }
  18. json_response = requests.get(json_url, headers=headers)
  19. time.sleep(2)
  20. article_info_lst = []
  21. json_data = json.loads(json_response.text)
  22. article_num = json_data['data']['total']
  23. print(f'article_num={article_num}')
  24. n = article_num // 20 + 1
  25. try:
  26. for i in range(n):
  27. json_url = f'https://blog.csdn.net/community/home-api/v1/get-business-list?page={i+1}&size=20&businessType=blog&orderby=&noMore=false&year=&month=&username={user_id}'
  28. json_response = requests.get(json_url, headers=headers)
  29. json_data = json.loads(json_response.text)
  30. article_lst = json_data['data']['list']
  31. for article in article_lst:
  32. article_info_lst.append((article['url'], article['title']))
  33. except:
  34. print(Exception)
  35. # 获取代理IP
  36. proxies_lst = []
  37. for i in range(1, 11):
  38. ip_url = f'http://www.kxdaili.com/dailiip/1/{i}.html'
  39. # http://www.kxdaili.com/dailiip/1/2.html
  40. # http://www.kxdaili.com/dailiip/1/3.html
  41. response = requests.get(ip_url)
  42. # print(response.text)
  43. html = response.text
  44. html = etree.HTML(html)
  45. ip_lst = html.xpath('//div[@class="header-container"]/div[2]/div[2]/div/div[2]/table/tbody/tr')
  46. # print(ip_lst)
  47. # print(len(ip_lst))
  48. for ip_info in ip_lst:
  49. ip = ip_info.xpath('./td[1]/text()')[0]
  50. port = ip_info.xpath('./td[2]/text()')[0]
  51. ht = ip_info.xpath('./td[4]/text()')[0]
  52. # print(ip, port, ht)
  53. proxies_info = {
  54. ht: ip + ':' + port
  55. }
  56. proxies_lst.append(proxies_info)
  57. for i in proxies_lst:
  58. print(i)
  59. print(len(proxies_lst))
  60. # 伪装浏览器和浏览足迹
  61. user_agent_list=[
  62. 'Mozilla/5.0(compatible;MSIE9.0;WindowsNT6.1;Trident/5.0)',
  63. 'Mozilla/4.0(compatible;MSIE8.0;WindowsNT6.0;Trident/4.0)',
  64. 'Mozilla/4.0(compatible;MSIE7.0;WindowsNT6.0)',
  65. 'Opera/9.80(WindowsNT6.1;U;en)Presto/2.8.131Version/11.11',
  66. 'Mozilla/5.0(WindowsNT6.1;rv:2.0.1)Gecko/20100101Firefox/4.0.1',
  67. 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
  68. 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
  69. 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
  70. 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36',
  71. 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'
  72. ]
  73. referer_list=[
  74. 'http://blog.csdn.net/dala_da/article/details/79401163',
  75. 'http://blog.csdn.net/',
  76. 'https://www.sogou.com/tx?query=%E4%BD%BF%E7%94%A8%E7%88%AC%E8%99%AB%E5%88%B7csdn%E8%AE%BF%E9%97%AE%E9%87%8F&hdq=sogou-site-706608cfdbcc1886-0001&ekv=2&ie=utf8&cid=qb7.zhuye&',
  77. 'https://www.baidu.com/s?tn=98074231_1_hao_pg&word=%E4%BD%BF%E7%94%A8%E7%88%AC%E8%99%AB%E5%88%B7csdn%E8%AE%BF%E9%97%AE%E9%87%8F'
  78. ]
  79. test_num = 1
  80. while True:
  81. print(f'第{test_num}轮')
  82. test_num += 1
  83. for article in article_info_lst:
  84. url = article[0]
  85. headers = {
  86. 'Referer': random.choice(referer_list),
  87. 'User-Agent': random.choice(user_agent_list)
  88. }
  89. pos = random.randint(0, len(proxies_lst) - 1)
  90. proxies = proxies_lst[pos]
  91. try:
  92. response = requests.get(url, headers=headers, proxies=proxies)
  93. html = response.text
  94. html = etree.HTML(html)
  95. read_num = html.xpath('//*[@id="mainBox"]/main/div/div/div/div[2]/div/div/span[@class="read-count"]/text()')[0]
  96. except ValueError:
  97. break
  98. else:
  99. print(f'状态码: {response.status_code}, ', end='')
  100. if response.status_code == 200:
  101. print(f'{url}访问成功,当前访问量为: {read_num}, 当前ip: {proxies}')
  102. else:
  103. print(f'{url}访问失败')
  104. time.sleep(1)
  105. time.sleep(10)

四、通过selenium工具实现访问控制

selenium工具是一个网站的自动化测试工具,在很多时候也用于爬虫爬取资源,不过selenium的效率相比于requests慢很多,所以很多时候能用requests直接拿到资源就不用selenium。

在很多爬虫之中,selenium对于资源的爬取只是一个辅助作用,它通过对浏览器的可视化访问控制,方便程序员对爬虫代码进行编写和优化。

通过selenium和requests可以轻松拿到前端代码,也可以通过selenium控制的访问按键改变浏览器路径,进行相关资源的访问或循环访问(翻页访问)。

拿到资源之后,便是对数据做处理,通过HTML或Json数据解析,提取到我们想要的数据,再做数据处理。

这是一个自动登录和批量三连(关注、点赞、评论)博客的爬虫:

  1. from selenium import webdriver
  2. from selenium.webdriver.chrome.options import Options
  3. from selenium.webdriver.common.by import By
  4. from lxml import etree
  5. import random
  6. import time
  7. import re
  8. import json
  9. import requests
  10. # 配置无头浏览器
  11. opt = Options()
  12. opt.add_argument("--headless")
  13. opt.add_argument("--disable-gpu")
  14. # 打开浏览器,无头浏览器,可设可不设
  15. driver = webdriver.Chrome(options=opt)
  16. # driver = webdriver.Chrome()
  17. # 登录
  18. url = "https://passport.csdn.net/login"
  19. driver.get(url)
  20. time.sleep(2)
  21. driver.find_element(By.XPATH, "/html/body/div[2]/div/div[2]/div[2]/div[1]/div/div[1]/span[4]").click()
  22. time.sleep(2)
  23. # 填写自己登录的账号密码
  24. id_number = input('请输入你的csdn账号: ')
  25. password = input('请输入你的csdn密码: ')
  26. driver.find_element(By.XPATH, '/html/body/div[2]/div/div[2]/div[2]/div[1]/div/div[2]/div/div[1]/div/input').send_keys(f'{id_number}')
  27. driver.find_element(By.XPATH, '/html/body/div[2]/div/div[2]/div[2]/div[1]/div/div[2]/div/div[2]/div/input').send_keys(f'{password}')
  28. time.sleep(2)
  29. driver.find_element(By.XPATH, '/html/body/div[2]/div/div[2]/div[2]/div[1]/div/div[2]/div/div[4]/button').click()
  30. time.sleep(2)
  31. # 用户主页
  32. user_url = input('请输入目标博主的主页链接:')
  33. driver.get(user_url)
  34. time.sleep(2)
  35. # 用re正则表达式从user_url中获得user_id
  36. user_id = re.split("/|\?", user_url)[3]
  37. json_url = f'https://blog.csdn.net/community/home-api/v1/get-business-list?page=1&size=20&businessType=blog&orderby=&noMore=false&year=&month=&username={user_id}'
  38. # 关注
  39. try:
  40. driver.find_element(By.LINK_TEXT, '关注').click()
  41. print(f'关注{user_id}成功')
  42. time.sleep(2)
  43. except:
  44. print(f'用户{user_id}已关注')
  45. # 请求json资源包
  46. headers = {
  47. 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36',
  48. 'referer': user_url,
  49. 'cookie': 'uuid_tt_dd=10_3110927480-1676090223071-792047; __bid_n=1863ec38aea95f6a424207; UN=phoenixFlyzzz; p_uid=U010000; _ga=GA1.2.993941723.1676213175; historyList-new=%5B%5D; Hm_ct_6bcd52f51e9b3dce32bec4a3997715ac=6525*1*10_3110927480-1676090223071-792047!5744*1*phoenixFlyzzz; FPTOKEN=rGJaKVnrAyrd9c6PNrWR621PRkeUFNL5oQN+ZcnMlhc1gi9jUB2f+3Lre4ssgxxkoHCAjPSQg38FYQVulxS85MVFhuGNp4Tj1sDo6/tLmWw+NYhN9elmUgZ6NEC48t5v2yT3LT4H61ZZJyeAvtv55Yd0cn6v3uEN4FoVd0mM1x2hF/Qz68/K5Hf63vIdlfpl+urOIv9VIuQSmABf0uxvOnsxMnMJOZInkuHt8hsy1qna5lTtPF6VWxTUPIC8dvoTqbr67BjcuEi4naB2tLElGXT5TjgnoWsInXpmD6ABYeF630/ex1x49imDOOKTGvYoNrbA4gYKSh3ePcRv1K8FPNuI8oRj1F+4gFTT9dJcgeK3lI4wO+NY0TiAAgWS4k8VpuntN0kYay1eKtUE2En3sA==|lzoBrn2+9F0BmgSIvcEt7t/AAp7YH4Yr0nrG43bNJ48=|10|fd2bfb9200cc0d87abf868edf8f4d31a; dp_token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpZCI6MTk3NjQ1MCwiZXhwIjoxNjg5NTMxMzE2LCJpYXQiOjE2ODg5MjY1MTYsInVzZXJuYW1lIjoicGhvZW5peEZseXp6eiJ9.rg0DgrqX7TQWPJosI-6OKmQtAraxmyBMfg0H0xerRpY; log_Id_view=24395; management_ques=1689227893320; hide_login=1; c_dl_fref=https://so.csdn.net/so/search; c_dl_prid=1689264739921_862614; c_dl_rid=1689264756287_665500; c_dl_fpage=/download/weixin_38722164/13767050; c_dl_um=distribute.pc_search_result.none-task-download-2%7Eall%7Efirst_rank_ecpm_v1%7Erank_v31_ecpm-3-13993802-null-null.142%5Ev88%5Econtrol_2%2C239%5Ev2%5Einsert_chatgpt; loginbox_strategy=%7B%22taskId%22%3A270%2C%22abCheckTime%22%3A1689240353169%2C%22version%22%3A%22notInDomain%22%2C%22blog-sixH-default%22%3A1689265737075%7D; UserName=phoenixFlyzzz; UserInfo=e8f9153e71c94dcabecc0827927e50c5; UserToken=e8f9153e71c94dcabecc0827927e50c5; UserNick=%E5%91%BD%E8%BF%90on-9; AU=D18; BT=1689265829191; Hm_up_6bcd52f51e9b3dce32bec4a3997715ac=%7B%22islogin%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isonline%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isvip%22%3A%7B%22value%22%3A%220%22%2C%22scope%22%3A1%7D%2C%22uid_%22%3A%7B%22value%22%3A%22phoenixFlyzzz%22%2C%22scope%22%3A1%7D%7D; log_Id_pv=3995; log_Id_click=6559; firstDie=1; Hm_lvt_e5ef47b9f471504959267fd614d579cd=1689268533; ssxmod_itna=QqGxgDnQGQ57qYKGHAonx02jRG8KqHYbii1mDlO3xA5D8D6DQeGTb0Y7eb=d1e7DCqfsqYZ2x3QtiA8GhmtCnxPhfmmDB3DEx0=KmCYxiinDCeDIDWeDiDGR7D=xGYDj0F/C9Dm4i7DYqGRDB6UCqDf+qGW7uQDmLNDGup6D7QDIw6g9R2DLeDSK7Ub7qDMUeGXSDa47dRWHpGMITnbWePuKCiDtqD94m=DbfL3x0pyRTrz88hr9OxQmG3Y4rqeY7DImDesQADe4SeYQD+GYGGNS7xj9O44DD3YY01beD===; ssxmod_itna2=QqGxgDnQGQ57qYKGHAonx02jRG8KqHYbii1D61frD0HPe031i70peDy09Dqn4nDkt7ORHokSGi0vxmjCBqhiF1l60OcsTX9M3e1ic/ZEcEBQSlbnEfMopKrUz54r8XGHYIckRuyTyWHEPm7novTcYFbdaYr2AYr/h51QKu73a9p5fENTb9sHRYzSeBAjeBCjB5sUmo10jn7CPTx6eTjqrAEe8Et9pfUtZLTCOSwFIkveM3dxNKhj/7fdPkb04uD1incIipNa=F7X=m1Kw974UDtx6DKq0RN9cdldWU=7DNq/CFzUpPeEf5BYrlD11YiPEsu0YjR=9EoZTxK2bBu=l3GYAbwds9EKAwqMuo1hrkCmLx1srOsmrlkY1oQiW5VYQ6ez6oI9jw+jt/0wRlYZ0wanNXrkUgmRmHTrd4SwObIMOE5uoWqKdAzjGrzEPVg5aqzRuwUQrlWhK2W4S5lMvKrjguYGdE6amV4OnuYspEiOQmWYvDDwc4DjKDewD4D=; c_utm_source=edu_txxl_mh; dc_session_id=10_1689309742332.208593; c_first_ref=default; c_segment=15; Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac=1688911197,1688917774,1689304257,1689309744; dc_sid=a1dffd08dd905125e95cd269df2ea4bc; FCNEC=%5B%5B%22AKsRol92q1iv8tx72fkK9bOYJMj_ruoB23PUFbGwA9z1pdh2biHzNAYEWChj9ex5C9gx7naL_pBnalXM2c1sI4Z6eFDqouJ775-0J12K75yqXnRA5tCEXkZiuEAZmQkJKkEPP--Di9CH84WWirUA2luc25OT2gWTBA%3D%3D%22%5D%2Cnull%2C%5B%5D%5D; csrfToken=PWrKJ_3MqdFIcAdzeDpS99mD; __gads=ID=be94ab085530c60b-22868fbfd3d900f6:T=1676560572:RT=1689312851:S=ALNI_MYNNxc0dxyRCaKnMGQnAKL5Qppr5g; __gpi=UID=00000bc4df7125c3:T=1676560572:RT=1689312851:S=ALNI_MZVPQ9kZkGSCUXxaL5KbHyGT69GBQ; log_Id_click=6560; c_utm_medium=distribute.pc_feed_blog.none-task-blog-personrec_tag-1-131698929-null-null.nonecase; https_waf_cookie=b23550e2-1410-49c5e754af82b31d803cdb7794d5e2b68935; log_Id_pv=3996; c_pref=default; c_first_page=https%3A//blog.csdn.net/m0_61780496; c_dsid=11_1689314745151.983284; c_ref=https%3A//blog.csdn.net/liusuihong919520/article/details/131698929%3Fspm%3D1001.2100.3001.7377%26utm_medium%3Ddistribute.pc_feed_blog.none-task-blog-personrec_tag-1-131698929-null-null.nonecase%26depth_1-utm_source%3Ddistribute.pc_feed_blog.none-task-blog-personrec_tag-1-131698929-null-null.nonecase; Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac=1689315357; c_page_id=default; dc_tos=rxrw3v'
  50. }
  51. json_response = requests.get(json_url, headers=headers)
  52. time.sleep(2)
  53. article_info_lst = []
  54. json_data = json.loads(json_response.text)
  55. article_num = json_data['data']['total']
  56. print(f'article_num={article_num}')
  57. n = article_num // 20 + 1
  58. try:
  59. for i in range(n):
  60. json_url = f'https://blog.csdn.net/community/home-api/v1/get-business-list?page={i+1}&size=20&businessType=blog&orderby=&noMore=false&year=&month=&username={user_id}'
  61. json_response = requests.get(json_url, headers=headers)
  62. json_data = json.loads(json_response.text)
  63. article_lst = json_data['data']['list']
  64. for article in article_lst:
  65. article_info_lst.append((article['url'], article['title']))
  66. except:
  67. print(Exception)
  68. article_num = 0
  69. # 每天的评论上限为10
  70. for article_info in article_info_lst:
  71. article_num += 1
  72. driver.get(article_info[0])
  73. time.sleep(3)
  74. # 页面滑动
  75. js = 'window.scrollTo(0, 1000)' # 向下滑
  76. driver.execute_script(js)
  77. time.sleep(1)
  78. # 点赞,若已经赞过则不点,而且点过赞说明也评论过,可以直接跳过不评论
  79. html_data = driver.page_source
  80. html_data = etree.HTML(html_data)
  81. flag = html_data.xpath('/html/body/div[3]/div/main/div[2]/div/div[2]/ul/li[1]/a/img[3]/@style')[0]
  82. if flag == 'display:none':
  83. print(f'第{article_num}篇文章:{article_info[1]},该文章已经点赞过')
  84. continue
  85. else:
  86. driver.find_element(By.XPATH, '/html/body/div[3]/div/main/div[2]/div/div[2]/ul/li[1]').click()
  87. # 评论
  88. content_lst = [
  89. '博主讲解得太详细了,通俗易懂,优质好文,必须三连支持!!!',
  90. '感谢博主细致的讲解,让我豁然开朗,非常感谢, 三连支持一波!!!',
  91. '非常优秀的博文,感谢博主!!!三连奉上!!!',
  92. '复习打卡冲冲冲,一起加油呀!!!感谢博主的细致讲解',
  93. '正在学习这方面的知识,这篇博文对我的帮助很大,非常感谢!'
  94. ]
  95. # 如果是对自己的文章进行评论,没有打赏标签,最后的标签是第4个,对别人的文章评论最后标签是第五个
  96. # driver.find_element(By.XPATH, '/html/body/div[3]/div/main/div[2]/div/div[2]/ul/li[4]').click()
  97. driver.find_element(By.XPATH, '/html/body/div[3]/div/main/div[2]/div/div[2]/ul/li[5]').click()
  98. time.sleep(1)
  99. driver.find_element(By.XPATH, '//*[@id="comment_content"]').send_keys(random.choice(content_lst))
  100. time.sleep(1)
  101. driver.find_element(By.XPATH, '//*[@id="commentform"]/div[2]/div[3]/div[4]/a/input').click()
  102. time.sleep(2)
  103. print(f'第{article_num}篇文章:{article_info[1]},三连已完成')

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/我家小花儿/article/detail/525658
推荐阅读
相关标签
  

闽ICP备14008679号