赞
踩
博主第一次写博文,第一次学爬虫,就是想分享,大家见怪不怪,
首先我设置了一个自定义UA代理池并没有采用插件pip install fake-useragent形式进行随机获取print(ua.ie)
下面是我修改了第一个错误之后的程序,我第一次写的是
- ua={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:65.0) Gecko/20100101 Firefox/65.0",
- "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:65.0) Gecko/20100101 Firefox/65.0"
- }
- url = 'http://www.baidu.com/'
- headers = ua_info.a
- req = request.Request(url=url, headers=headers)
- res = urllib.request.urlopen(req)
- #html = res.read().decode('utf-8')
- print(html)
遇到的第一个问题:
- Traceback (most recent call last):
- File "C:\Programs\Python\pythonProject\main.py", line 25, in
- req = request.Request(url=url, headers=headers)
- File "C:\Programs\Python\Python39\lib\urllib\request.py", line 326, in init
- for key, value in headers.items():
- AttributeError: 'str' object has no attribute 'items'
-
- Process finished with exit code 1
改好第一个问题之后的程序
- ua_list = [
- 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0',
- 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
- 'User-Agent:Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
- 'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
- 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
- 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
- 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
- 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1',
- 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1',
- 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
- ]
- a = random.choice(ua_list)
- print(a)
- url = 'http://www.baidu.com/'
- rs1 = ua_info.a
- headers = {'User-Agent': rs1}
-
- # 1、创建请求对象,包装ua信息
- # req = request.Request(url=url, headers=headers)
-
- query_string = {
- 'wd': '爬虫'
- }
- result = parse.urlencode(query_string)
- url1 = 'http://www.baidu.com/s?{}'.format(result)
- req = request.Request(url=url1, headers=headers)
- res = urllib.request.urlopen(req)
- html = res.read().decode('utf-8')
- print(html)
爬个五次吧,出现了下面结果
- <!DOCTYPE html>
- <html lang="zh-CN">
- <head>
- <meta charset="utf-8">
- <title>百度安全验证</title>
- <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
- <meta name="apple-mobile-web-app-capable" content="yes">
- <meta name="apple-mobile-web-app-status-bar-style" content="black">
- <meta name="viewport" content="width=device-width, user-scalable=no, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0">
- <meta name="format-detection" content="telephone=no, email=no">
- <link rel="shortcut icon" href="https://www.baidu.com/favicon.ico" type="image/x-icon">
- <link rel="icon" sizes="any" mask href="https://www.baidu.com/img/baidu.svg">
- <meta http-equiv="X-UA-Compatible" content="IE=Edge">
- <meta http-equiv="Content-Security-Policy" content="upgrade-insecure-requests">
- <link rel="stylesheet" href="https://wappass.bdimg.com/static/touch/css/api/mkdjump_0635445.css" />
- </head>
- <body>
- <div class="timeout hide">
- <div class="timeout-img"></div>
- <div class="timeout-title">网络不给力,请稍后重试</div>
- <button type="button" class="timeout-button">返回首页</button>
- </div>
- <div class="timeout-feedback hide">
- <div class="timeout-feedback-icon"></div>
- <p class="timeout-feedback-title">问题反馈</p>
- </div>
查百度解决方案让我在headers中加个参数,并说明找到的位置,并且已经得到了解决,
- headers = {
- 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36 Edg/83.0.478.50',
- 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'
- }
好奇之下我查了爬虫与反爬的对抗,如下
试了 试下面的代码,也是可以的,但是会报警告
headers={'User-Agent':'Baiduspider'}
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。