爬虫之静态页面抓取_python静态爬虫数据

作者：运维做开发 | 2024-08-20 11:10:43

踩

python静态爬虫数据

静态网页抓取

在网络爬虫中，静态网页的数据比较容易获取，因为其所有数据都呈现在网页的HTML代码中

在静态网页抓取中，Python中的Requests库能够容易实现这个需求

通过requests发起Http请求

import requests
url="http://www.santostang.com/"
r=requests.get(url)
print("文本编码:",r.encoding)
print("响应状态码:",r.status_code)
print("响应文本内容:",r.text)
1
2
3
4
5
6

文本编码: UTF-8
响应状态码: 200
响应文本内容: <!DOCTYPE html>
<html lang="zh-CN">
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">
<title>Santos Tang</title>
<meta name="description" content="Python网络爬虫：从入门到实践 官方网站及个人博客" />
<meta name="keywords" content="Python网络爬虫, Python爬虫, Python, 爬虫, 数据科学, 数据挖掘, 数据分析, santostang, Santos Tang, 唐松, Song Tang" />
<link rel="apple-touch-icon" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/images/icon_32.png">
<link rel="apple-touch-icon" sizes="152x152" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/images/icon_152.png">
<link rel="apple-touch-icon" sizes="167x167" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/images/icon_167.png">
<link rel="apple-touch-icon" sizes="180x180" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/images/icon_180.png">
<link rel="icon" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/images/icon_32.png" type="image/x-icon">
<link rel="stylesheet" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/css/bootstrap.min.css">
<link rel="stylesheet" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/css/fontawesome.min.css">
<link rel="stylesheet" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/style.css">
<link rel="pingback" href="http://www.santostang.com/xmlrpc.php" />
<style type="text/css">
a{color:#1e73be}
a:hover{color:#2980b9!important}
#header{background-color:#1e73be}
.widget .widget-title::after{background-color:#1e73be}
.uptop{border-left-color:#1e73be}
#titleBar .toggle:before{background:#1e73be}
</style>
</head>

<body>
<header id="header">
    <div class="avatar"><a href="http://www.santostang.com" title="Santos Tang"><img src="http://www.santostang.com/wp-content/uploads/2019/06/me.jpg" alt="Santos Tang" class="img-circle" width="50%"></a></div>
    <h1 id="name">Santos Tang</h1>
    <div class="sns">
                <a href="https://weibo.com/santostang" target="_blank" rel="nofollow" data-toggle="tooltip" data-placement="top" title="Weibo"><i class="fab fa-weibo"></i></a>        <a href="https://www.linkedin.com/in/santostang" target="_blank" rel="nofollow" data-toggle="tooltip" data-placement="top" title="Linkedin"><i class="fab fa-linkedin"></i></a>        <a href="https://www.zhihu.com/people/santostang" target="_blank" rel="nofollow" data-toggle="tooltip" data-placement="top" title="Zhihu"><i class="fab fa-zhihu"></i></a>        <a href="https://github.com/santostang" target="_blank" rel="nofollow" data-toggle="tooltip" data-placement="top" title="GitHub"><i class="fab fa-github-alt"></i></a>    </div>
    <div class="nav">
        <ul><li><a href="http://www.santostang.com/">首页</a></li>
<li><a href="http://www.santostang.com/aboutme/">关于我</a></li>
<li><a href="http://www.santostang.com/python%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab%e4%bb%a3%e7%a0%81/">爬虫书代码</a></li>
<li><a href="http://www.santostang.com/%e5%8a%a0%e6%88%91%e5%be%ae%e4%bf%a1/">加我微信</a></li>
<li><a href="https://santostang.github.io/">EnglishSite</a></li>
</ul>    </div>
        <div class="weixin">
        <img src="http://www.santostang.com/wp-content/uploads/2019/06/qrcode_for_gh_370f70791e19_258.jpg" alt="微信公众号" width="50%">
        <p>微信公众号</p>
    </div>
    </header>
<div id="main">
    <div class="row box">
        <div class="col-md-8">
                                <h2 class="uptop"><i class="fas fa-arrow-circle-up"></i> <a href="http://www.santostang.com/2018/07/11/%e3%80%8a%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab%ef%bc%9a%e4%bb%8e%e5%85%a5%e9%97%a8%e5%88%b0%e5%ae%9e%e8%b7%b5%e3%80%8b%e4%b8%80%e4%b9%a6%e5%8b%98%e8%af%af/" target="_blank">《网络爬虫：从入门到实践》一书勘误</a></h2>
                                                <article class="article-list-1 clearfix">
                <header class="clearfix">
                    <h1 class="post-title"><a href="http://www.santostang.com/2018/07/15/4-3-%e9%80%9a%e8%bf%87selenium-%e6%a8%a1%e6%8b%9f%e6%b5%8f%e8%a7%88%e5%99%a8%e6%8a%93%e5%8f%96/">第四章 &#8211; 4.3 通过selenium 模拟浏览器抓取</a></h1>
                    <div class="post-meta">
                        <span class="meta-span"><i class="far fa-calendar-alt"></i> 07月15日</span>
                        <span class="meta-span"><i class="far fa-folder"></i> <a href="http://www.santostang.com/category/python-%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab/" rel="category tag">Python 网络爬虫</a></span>
                        <span class="meta-span"><i class="fas fa-comments"></i> <a href="http://www.santostang.com/2018/07/15/4-3-%e9%80%9a%e8%bf%87selenium-%e6%a8%a1%e6%8b%9f%e6%b5%8f%e8%a7%88%e5%99%a8%e6%8a%93%e5%8f%96/#respond">没有评论</a></span>
                        <span class="meta-span hidden-xs"><i class="fas fa-tags"></i> </span>
                    </div>
                </header>
                <div class="post-content clearfix">
                    <p>4.3 通过selenium 模拟浏览器抓取

在上述的例子中，使用Chrome“检查”功能找到源地址还十分容易。但是有一些网站非常复杂，例如前面的天猫产品评论，使用“检查”功能很难找到调用的网页地址。除此之外，有一些数据...</p>
                </div>
            </article>
                                                <article class="article-list-1 clearfix">
                <header class="clearfix">
                    <h1 class="post-title"><a href="http://www.santostang.com/2018/07/14/4-2-%e8%a7%a3%e6%9e%90%e7%9c%9f%e5%ae%9e%e5%9c%b0%e5%9d%80%e6%8a%93%e5%8f%96/">第四章 &#8211; 4.2 解析真实地址抓取</a></h1>
                    <div class="post-meta">
                        <span class="meta-span"><i class="far fa-calendar-alt"></i> 07月14日</span>
                        <span class="meta-span"><i class="far fa-folder"></i> <a href="http://www.santostang.com/category/python-%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab/" rel="category tag">Python 网络爬虫</a></span>
                        <span class="meta-span"><i class="fas fa-comments"></i> <a href="http://www.santostang.com/2018/07/14/4-2-%e8%a7%a3%e6%9e%90%e7%9c%9f%e5%ae%9e%e5%9c%b0%e5%9d%80%e6%8a%93%e5%8f%96/#respond">没有评论</a></span>
                        <span class="meta-span hidden-xs"><i class="fas fa-tags"></i> <a href="http://www.santostang.com/tag/ajax/" rel="tag">ajax</a>,<a href="http://www.santostang.com/tag/python/" rel="tag">python</a>,<a href="http://www.santostang.com/tag/%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab/" rel="tag">网络爬虫</a>,<a href="http://www.santostang.com/tag/%e7%bd%91%e9%a1%b5%e7%88%ac%e8%99%ab/" rel="tag">网页爬虫</a>,<a href="http://www.santostang.com/tag/%e8%a7%a3%e6%9e%90%e5%9c%b0%e5%9d%80/" rel="tag">解析地址</a></span>
                    </div>
                </header>
                <div class="post-content clearfix">
                    <p>由于网易云跟帖停止服务，现在已经在此处中更新了新写的第四章。请参照文章：
4.2 解析真实地址抓取
虽然数据并没有出现在网页源代码中，我们也可以找到数据的真实地址，请求这个真实地址也可以获得想要的数据。...</p>
                </div>
            </article>
                                                <article class="article-list-1 clearfix">
                <header class="clearfix">
                    <h1 class="post-title"><a href="http://www.santostang.com/2018/07/14/%e7%ac%ac%e5%9b%9b%e7%ab%a0%ef%bc%9a%e5%8a%a8%e6%80%81%e7%bd%91%e9%a1%b5%e6%8a%93%e5%8f%96-%e8%a7%a3%e6%9e%90%e7%9c%9f%e5%ae%9e%e5%9c%b0%e5%9d%80-selenium/">第四章- 动态网页抓取 (解析真实地址 + selenium)</a></h1>
                    <div class="post-meta">
                        <span class="meta-span"><i class="far fa-calendar-alt"></i> 07月14日</span>
                        <span class="meta-span"><i class="far fa-folder"></i> <a href="http://www.santostang.com/category/python-%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab/" rel="category tag">Python 网络爬虫</a></span>
                        <span class="meta-span"><i class="fas fa-comments"></i> <a href="http://www.santostang.com/2018/07/14/%e7%ac%ac%e5%9b%9b%e7%ab%a0%ef%bc%9a%e5%8a%a8%e6%80%81%e7%bd%91%e9%a1%b5%e6%8a%93%e5%8f%96-%e8%a7%a3%e6%9e%90%e7%9c%9f%e5%ae%9e%e5%9c%b0%e5%9d%80-selenium/#respond">没有评论</a></span>
                        <span class="meta-span hidden-xs"><i class="fas fa-tags"></i> <a href="http://www.santostang.com/tag/ajax/" rel="tag">ajax</a>,<a href="http://www.santostang.com/tag/javascript/" rel="tag">javascript</a>,<a href="http://www.santostang.com/tag/python/" rel="tag">python</a>,<a href="http://www.santostang.com/tag/selenium/" rel="tag">selenium</a>,<a href="http://www.santostang.com/tag/%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab/" rel="tag">网络爬虫</a></span>
                    </div>
                </header>
                <div class="post-content clearfix">
                    <p>由于网易云跟帖停止服务，现在已经在此处中更新了新写的第四章。请参照文章：
前面爬取的网页均为静态网页，这样的网页在浏览器中展示的内容都在HTML源代码中。但是，由于主流网站都使用JavaScript展现网页内容，...</p>
                </div>
            </article>
                                                <article class="article-list-1 clearfix">
                <header class="clearfix">
                    <h1 class="post-title"><a href="http://www.santostang.com/2018/07/04/hello-world/">Hello world!</a></h1>
                    <div class="post-meta">
                        <span class="meta-span"><i class="far fa-calendar-alt"></i> 07月04日</span>
                        <span class="meta-span"><i class="far fa-folder"></i> <a href="http://www.santostang.com/category/python-%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab/" rel="category tag">Python 网络爬虫</a></span>
                        <span class="meta-span"><i class="fas fa-comments"></i> <a href="http://www.santostang.com/2018/07/04/hello-world/#comments">1条评论</a></span>
                        <span class="meta-span hidden-xs"><i class="fas fa-tags"></i> </span>
                    </div>
                </header>
                <div class="post-content clearfix">
                    <p>Welcome to WordPress. This is your first post. Edit or delete it, then start writing!
各位读者，由于网易云跟帖在本书出版后已经停止服务，书中的第四章已经无法使用。所以我将本书的评论系统换成了来必力...</p>
                </div>
            </article>
                                    <nav style="float:right">
                            </nav>
        </div>
        <div class="col-md-4 hidden-xs hidden-sm">
            <aside class="widget clearfix">
    <form id="searchform" action="http://www.santostang.com">
        <div class="input-group">
            <input type="search" class="form-control" placeholder="搜索…" value="" name="s">
            <span class="input-group-btn"><button class="btn btn-default" type="submit"><i class="fas fa-search"></i></button></span>
        </div>
    </form>
</aside>
<aside class="widget clearfix">
    <h4 class="widget-title">文章分类</h4>
    <ul class="widget-cat">
        	<li class="cat-item cat-item-2"><a href="http://www.santostang.com/category/python-%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab/" >Python 网络爬虫</a> (5)
</li>
    </ul>
</aside>
<aside class="widget clearfix">
    <h4 class="widget-title">热门文章</h4>
    <ul class="widget-hot">
            </ul>
</aside>
<aside class="widget clearfix">
    <h4 class="widget-title">随机推荐</h4>
    <ul class="widget-hot">
            <li><a href="http://www.santostang.com/2018/07/04/hello-world/" title="Hello world!">Hello world!</a></li>
            <li><a href="http://www.santostang.com/2018/07/11/%e3%80%8a%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab%ef%bc%9a%e4%bb%8e%e5%85%a5%e9%97%a8%e5%88%b0%e5%ae%9e%e8%b7%b5%e3%80%8b%e4%b8%80%e4%b9%a6%e5%8b%98%e8%af%af/" title="《网络爬虫：从入门到实践》一书勘误">《网络爬虫：从入门到实践》一书勘误</a></li>
            <li><a href="http://www.santostang.com/2018/07/15/4-3-%e9%80%9a%e8%bf%87selenium-%e6%a8%a1%e6%8b%9f%e6%b5%8f%e8%a7%88%e5%99%a8%e6%8a%93%e5%8f%96/" title="第四章 &#8211; 4.3 通过selenium 模拟浏览器抓取">第四章 &#8211; 4.3 通过selenium 模拟浏览器抓取</a></li>
            <li><a href="http://www.santostang.com/2018/07/14/%e7%ac%ac%e5%9b%9b%e7%ab%a0%ef%bc%9a%e5%8a%a8%e6%80%81%e7%bd%91%e9%a1%b5%e6%8a%93%e5%8f%96-%e8%a7%a3%e6%9e%90%e7%9c%9f%e5%ae%9e%e5%9c%b0%e5%9d%80-selenium/" title="第四章- 动态网页抓取 (解析真实地址 + selenium)">第四章- 动态网页抓取 (解析真实地址 + selenium)</a></li>
            <li><a href="http://www.santostang.com/2018/07/14/4-2-%e8%a7%a3%e6%9e%90%e7%9c%9f%e5%ae%9e%e5%9c%b0%e5%9d%80%e6%8a%93%e5%8f%96/" title="第四章 &#8211; 4.2 解析真实地址抓取">第四章 &#8211; 4.2 解析真实地址抓取</a></li>
        </ul>
</aside>
<aside class="widget clearfix">
    <h4 class="widget-title">标签云</h4>
    <div class="widget-tags">
        <a href="http://www.santostang.com/tag/ajax/" class="tag-cloud-link tag-link-3 tag-link-position-1" style="color:#66807;font-size: 22pt;" aria-label="ajax (2个项目);">ajax</a>
<a href="http://www.santostang.com/tag/javascript/" class="tag-cloud-link tag-link-4 tag-link-position-2" style="color:#f508f0;font-size: 8pt;" aria-label="javascript (1个项目);">javascript</a>
<a href="http://www.santostang.com/tag/python/" class="tag-cloud-link tag-link-5 tag-link-position-3" style="color:#cd09c1;font-size: 22pt;" aria-label="python (2个项目);">python</a>
<a href="http://www.santostang.com/tag/selenium/" class="tag-cloud-link tag-link-6 tag-link-position-4" style="color:#39f9b0;font-size: 8pt;" aria-label="selenium (1个项目);">selenium</a>
<a href="http://www.santostang.com/tag/%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab/" class="tag-cloud-link tag-link-8 tag-link-position-5" style="color:#eca091;font-size: 22pt;" aria-label="网络爬虫 (2个项目);">网络爬虫</a>
<a href="http://www.santostang.com/tag/%e7%bd%91%e9%a1%b5%e7%88%ac%e8%99%ab/" class="tag-cloud-link tag-link-9 tag-link-position-6" style="color:#369ed1;font-size: 8pt;" aria-label="网页爬虫 (1个项目);">网页爬虫</a>
<a href="http://www.santostang.com/tag/%e8%a7%a3%e6%9e%90%e5%9c%b0%e5%9d%80/" class="tag-cloud-link tag-link-10 tag-link-position-7" style="color:#c781f8;font-size: 8pt;" aria-label="解析地址 (1个项目);">解析地址</a>    </div>
</aside>
<aside class="widget clearfix">
    <h4 class="widget-title">友情链接</h4>
    <ul class="widget-links">
            </ul>
</aside>
        </div>
    </div>
</div>
<div class="footer_search visible-xs visible-sm">
    <form id="searchform" action="http://www.santostang.com">
        <div class="input-group">
            <input type="search" class="form-control" placeholder="搜索…" value="" name="s">
            <span class="input-group-btn"><button class="btn btn-default" type="submit"><i class="fas fa-search"></i></button></span>
        </div>
    </form>
</div>
<footer id="footer">
    <div class="copyright">
        <p><i class="far fa-copyright"></i> 2019 <b>唐松-数据科学 版权所有</b></p>
        <p>Powered by <b>WordPress</b>. Theme by <a href="https://tangjie.me/jiestyle-two" data-toggle="tooltip" data-placement="top" title="WordPress 主题模板" target="_blank"><b>JieStyle Two</b></a> | <a href="http://beian.miit.gov.cn" data-toggle="tooltip" data-placement="top" target="_blank"><b>粤ICP备19068356号</b></a> </p> </p>
    </div>
    <div style="display:none;">代码在页面底部，统计标识不会显示，但不影响统计效果</div>
</footer>
<script type="text/javascript" src="http://www.santostang.com/wp-content/themes/SongStyle-Two/js/jquery.min.js"></script>
<script type="text/javascript" src="http://www.santostang.com/wp-content/themes/SongStyle-Two/js/bootstrap.min.js"></script>
<script type="text/javascript" src="http://www.santostang.com/wp-content/themes/SongStyle-Two/js/skel.min.js"></script>
<script type="text/javascript" src="http://www.santostang.com/wp-content/themes/SongStyle-Two/js/util.min.js"></script>
<script type="text/javascript" src="http://www.santostang.com/wp-content/themes/SongStyle-Two/js/nav.js"></script>
<script type='text/javascript' src='http://www.santostang.com/wp-includes/js/jquery/jquery.js?ver=1.12.4'></script>
<script type='text/javascript' src='http://www.santostang.com/wp-includes/js/jquery/jquery-migrate.min.js?ver=1.4.1'></script>
<script type='text/javascript' src='http://www.santostang.com/wp-content/plugins/captcha-bank/assets/global/plugins/custom/js/front-end-script.js?ver=4.8.17'></script>
<script>
$(function() {
    $('[data-toggle="tooltip"]').tooltip()
});
</script>
<script>
(function(){
    var bp = document.createElement('script');
    var curProtocol = window.location.protocol.split(':')[0];
    if (curProtocol === 'https') {
        bp.src = 'https://zz.bdstatic.com/linksubmit/push.js';
    }
    else {
        bp.src = 'http://push.zhanzhang.baidu.com/push.js';
    }
    var s = document.getElementsByTagName("script")[0];
    s.parentNode.insertBefore(bp, s);
})();
</script>
<script>
var _hmt = _hmt || [];
(function() {
  var hm = document.createElement("script");
  hm.src = "https://hm.baidu.com/hm.js?752e310cec7906ba7afeb24cd7114c48";
  var s = document.getElementsByTagName("script")[0]; 
  s.parentNode.insertBefore(hm, s);
})();
</script>
</body>
</html>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220

使用说明

r.text:服务器响应的内容,会根据响应头部的字符编码进行解码
r.encoding:服务器内容使用的文本编码
r.status_code:响应状态码
- 200:请求成功
- 4xx:客户端错误
- 5xx:服务器错误响应
r.content:字节方式的响应
r.json():Requests内置的JSON解码器


1
'运行

requests的用法

1.URL参数

有时我们为了请求特定的数据,需要直接在URL中加入一些数据,并且这些数据为在一个问号后面

如点击百度中的某条新闻,其链接为https://baijiahao.baidu.com/s?id=1719021671421019613&wfr=spider&for=pc,其URL中的?后面会跟着一些属性值

#示例,访问http://httpbin.org/get？key1=value1&&key2=value2
import requests
URL="http://httpbin.org/get"
key_dict={"key1":"value1","key2":"value2"}
r=requests.get(URL,params=key_dict)
print("已经编码的URL:",r.url)
print("响应文本内容:",r.text)
1
2
3
4
5
6
7

已经编码的URL: http://httpbin.org/get?key1=value1&key2=value2
响应文本内容: {
  "args": {
    "key1": "value1", 
    "key2": "value2"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.25.1", 
    "X-Amzn-Trace-Id": "Root=1-61b9c935-6c8a0f6c164028ad049e22f7"
  }, 
  "origin": "115.156.142.199", 
  "url": "http://httpbin.org/get?key1=value1&key2=value2"
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

2.定制请求头部

请求头部Headers提供了关于请求,响应或其他发送实体的信息

如何查看请求头:

浏览器点击检查
查看Network
查看Headers
一般只需添加浏览器信息即可
- 即user-agent

import requests
URL="http://www.santostang.com/"
headers={"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"}
r=requests.get(URL,headers=headers)
print("响应状态:",r.status_code)
1
2
3
4
5

响应状态: 200
1

3.发送Post请求

有时访问网站需要提交一些表单形式的数据,这是需要发送POST请求

如果要发送POST请求,只需简单传递一个字典数据给Requests中data参数

#示例
import requests
URL="http://httpbin.org/post"
key_dict={"key1":"value1","key2":"value2"}
r=requests.post(URL,data=key_dict)
print("已经编码的URL:",r.url)
print("响应文本内容:",r.text)
1
2
3
4
5
6
7

已经编码的URL: http://httpbin.org/post
响应文本内容: {
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key1": "value1", 
    "key2": "value2"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "23", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.25.1", 
    "X-Amzn-Trace-Id": "Root=1-61b9cd28-16dce6c31b3e09c45cf7ef91"
  }, 
  "json": null, 
  "origin": "115.156.142.199", 
  "url": "http://httpbin.org/post"
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

4.超时处理

有时爬虫会遇到服务器长时间没有响应,这时候会造成爬虫程序一直等待

这是可以通过Requests中的timeout参数设置定时器,如果定时器触发后没有响应,就会返回异常

import requests
URL="http://www.santostang.com/"
r=requests.get(URL,timeout=0.001)
1
2
3

---------------------------------------------------------------------------

timeout                                   Traceback (most recent call last)

D:\ProgramData\Anaconda\lib\site-packages\urllib3\connection.py in _new_conn(self)
    168         try:
--> 169             conn = connection.create_connection(
    170                 (self._dns_host, self.port), self.timeout, **extra_kw


D:\ProgramData\Anaconda\lib\site-packages\urllib3\util\connection.py in create_connection(address, timeout, source_address, socket_options)
     95     if err is not None:
---> 96         raise err
     97 


D:\ProgramData\Anaconda\lib\site-packages\urllib3\util\connection.py in create_connection(address, timeout, source_address, socket_options)
     85                 sock.bind(source_address)
---> 86             sock.connect(sa)
     87             return sock


timeout: timed out


During handling of the above exception, another exception occurred:


ConnectTimeoutError                       Traceback (most recent call last)

D:\ProgramData\Anaconda\lib\site-packages\urllib3\connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    698             # Make the request on the httplib connection object.
--> 699             httplib_response = self._make_request(
    700                 conn,


D:\ProgramData\Anaconda\lib\site-packages\urllib3\connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    393             else:
--> 394                 conn.request(method, url, **httplib_request_kw)
    395 


D:\ProgramData\Anaconda\lib\site-packages\urllib3\connection.py in request(self, method, url, body, headers)
    233             headers["User-Agent"] = _get_default_user_agent()
--> 234         super(HTTPConnection, self).request(method, url, body=body, headers=headers)
    235 


D:\ProgramData\Anaconda\lib\http\client.py in request(self, method, url, body, headers, encode_chunked)
   1254         """Send a complete request to the server."""
-> 1255         self._send_request(method, url, body, headers, encode_chunked)
   1256 


D:\ProgramData\Anaconda\lib\http\client.py in _send_request(self, method, url, body, headers, encode_chunked)
   1300             body = _encode(body, 'body')
-> 1301         self.endheaders(body, encode_chunked=encode_chunked)
   1302 


D:\ProgramData\Anaconda\lib\http\client.py in endheaders(self, message_body, encode_chunked)
   1249             raise CannotSendHeader()
-> 1250         self._send_output(message_body, encode_chunked=encode_chunked)
   1251 


D:\ProgramData\Anaconda\lib\http\client.py in _send_output(self, message_body, encode_chunked)
   1009         del self._buffer[:]
-> 1010         self.send(msg)
   1011 


D:\ProgramData\Anaconda\lib\http\client.py in send(self, data)
    949             if self.auto_open:
--> 950                 self.connect()
    951             else:


D:\ProgramData\Anaconda\lib\site-packages\urllib3\connection.py in connect(self)
    199     def connect(self):
--> 200         conn = self._new_conn()
    201         self._prepare_conn(conn)


D:\ProgramData\Anaconda\lib\site-packages\urllib3\connection.py in _new_conn(self)
    173         except SocketTimeout:
--> 174             raise ConnectTimeoutError(
    175                 self,


ConnectTimeoutError: (<urllib3.connection.HTTPConnection object at 0x000002190B8302E0>, 'Connection to www.santostang.com timed out. (connect timeout=0.001)')


During handling of the above exception, another exception occurred:


MaxRetryError                             Traceback (most recent call last)

D:\ProgramData\Anaconda\lib\site-packages\requests\adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    438             if not chunked:
--> 439                 resp = conn.urlopen(
    440                     method=request.method,


D:\ProgramData\Anaconda\lib\site-packages\urllib3\connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    754 
--> 755             retries = retries.increment(
    756                 method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]


D:\ProgramData\Anaconda\lib\site-packages\urllib3\util\retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
    573         if new_retry.is_exhausted():
--> 574             raise MaxRetryError(_pool, url, error or ResponseError(cause))
    575 


MaxRetryError: HTTPConnectionPool(host='www.santostang.com', port=80): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x000002190B8302E0>, 'Connection to www.santostang.com timed out. (connect timeout=0.001)'))


During handling of the above exception, another exception occurred:


ConnectTimeout                            Traceback (most recent call last)

<ipython-input-15-c236f0bf58c5> in <module>
      1 import requests
      2 URL="http://www.santostang.com/"
----> 3 r=requests.get(URL,timeout=0.001)


D:\ProgramData\Anaconda\lib\site-packages\requests\api.py in get(url, params, **kwargs)
     74 
     75     kwargs.setdefault('allow_redirects', True)
---> 76     return request('get', url, params=params, **kwargs)
     77 
     78 


D:\ProgramData\Anaconda\lib\site-packages\requests\api.py in request(method, url, **kwargs)
     59     # cases, and look like a memory leak in others.
     60     with sessions.Session() as session:
---> 61         return session.request(method=method, url=url, **kwargs)
     62 
     63 


D:\ProgramData\Anaconda\lib\site-packages\requests\sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    540         }
    541         send_kwargs.update(settings)
--> 542         resp = self.send(prep, **send_kwargs)
    543 
    544         return resp


D:\ProgramData\Anaconda\lib\site-packages\requests\sessions.py in send(self, request, **kwargs)
    653 
    654         # Send the request
--> 655         r = adapter.send(request, **kwargs)
    656 
    657         # Total elapsed time of the request (approximately)


D:\ProgramData\Anaconda\lib\site-packages\requests\adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    502                 # TODO: Remove this in 3.0.0: see #2811
    503                 if not isinstance(e.reason, NewConnectionError):
--> 504                     raise ConnectTimeout(e, request=request)
    505 
    506             if isinstance(e.reason, ResponseError):


ConnectTimeout: HTTPConnectionPool(host='www.santostang.com', port=80): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x000002190B8302E0>, 'Connection to www.santostang.com timed out. (connect timeout=0.001)'))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171


1
'运行

实践爬取豆瓣Top250页面

我们观察发现,网站一页最多显示25部电影,且其每个页面的url为:https://movie.douban.com/top250?start=value

其中value为0,25,50,…

import requests
url="https://movie.douban.com/top250?"
key_dict={"start":0}
headers={"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"}
for i in range(10):
    key_dict["start"]=25*i
    r=requests.get(url,params=key_dict,headers=headers)
    path=str(i)+".txt"
    print(r.status_code)
    fp=open(path,"w",encoding="utf-8")
    fp.write(r.text)
    fp.close()
1
2
3
4
5
6
7
8
9
10
11
12

页面内容:
在这里插入图片描述

#将电影名提取出来并保存为txt
from bs4 import BeautifulSoup
url="https://movie.douban.com/top250?"
key_dict={"start":0}
headers={"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"}
movie_list=[]
for i in range(10):
    key_dict["start"]=25*i
    r=requests.get(url,params=key_dict,headers=headers)
    print(r.status_code)
    soup=BeautifulSoup(r.text,"lxml")
    content=soup.find_all("div",class_="hd")
    for each in content:
        movie=each.a.span.text.strip()
        movie_list.append(movie)
print(movie_list)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

200
200
200
200
200
200
200
200
200
200
['肖申克的救赎', '霸王别姬', '阿甘正传', '这个杀手不太冷', '泰坦尼克号', '美丽人生', '千与千寻', '辛德勒的名单', '盗梦空间', '忠犬八公的故事', '星际穿越', '楚门的世界', '海上钢琴师', '三傻大闹宝莱坞', '机器人总动员', '放牛班的春天', '无间道', '疯狂动物城', '大话西游之大圣娶亲', '熔炉', '教父', '当幸福来敲门', '控方证人', '龙猫', '怦然心动', '触不可及', '末代皇帝', '蝙蝠侠：黑暗骑士', '寻梦环游记', '活着', '指环王3：王者无敌', '哈利·波特与魔法石', '乱世佳人', '素媛', '飞屋环游记', '摔跤吧！爸爸', '何以为家', '哈尔的移动城堡', '十二怒汉', '我不是药神', '少年派的奇幻漂流', '鬼子来了', '大话西游之月光宝盒', '天空之城', '天堂电影院', '猫鼠游戏', '闻香识女人', '指环王2：双塔奇兵', '罗马假日', '钢琴家', '让子弹飞', '指环王1：护戒使者', '辩护人', '大闹天宫', '教父2', '黑客帝国', '狮子王', '死亡诗社', '海蒂和爷爷', '搏击俱乐部', '绿皮书', '饮食男女', '美丽心灵', '窃听风暴', '本杰明·巴顿奇事', '情书', '两杆大烟枪', '穿条纹睡衣的男孩', '西西里的美丽传说', '看不见的客人', '飞越疯人院', '拯救大兵瑞恩', '音乐之声', '小鞋子', '阿凡达', '海豚湾', '致命魔术', '沉默的羔羊', '哈利·波特与死亡圣器(下)', '美国往事', '禁闭岛', '蝴蝶效应', '布达佩斯大饭店', '心灵捕手', '低俗小说', '春光乍泄', '摩登时代', '七宗罪', '喜剧之王', '致命ID', '被嫌弃的松子的一生', '杀人回忆', '加勒比海盗', '红辣椒', '狩猎', '剪刀手爱德华', '请以你的名字呼唤我', '勇敢的心', '7号房的礼物', '功夫', '超脱', '断背山', '哈利·波特与阿兹卡班的囚徒', '天使爱美丽', '入殓师', '唐伯虎点秋香', '第六感', '幽灵公主', '重庆森林', '小森林 夏秋篇', '阳光灿烂的日子', '爱在黎明破晓前', '一一', '蝙蝠侠：黑暗骑士崛起', '菊次郎的夏天', '哈利·波特与密室', '消失的爱人', '超能陆战队', '无人知晓', '小森林 冬春篇', '完美的世界', '倩女幽魂', '爱在日落黄昏时', '侧耳倾听', '借东西的小人阿莉埃蒂', '甜蜜蜜', '萤火之森', '驯龙高手', '幸福终点站', '玛丽和马克思', '时空恋旅人', '大鱼', '怪兽电力公司', '告白', '阳光姐妹淘', '射雕英雄传之东成西就', '神偷奶爸', '傲慢与偏见', '教父3', '玩具总动员3', '釜山行', '恐怖直播', '一个叫欧维的男人决定去死', '哪吒闹海', '被解救的姜戈', '血战钢锯岭', '未麻的部屋', '头号玩家', '我是山姆', '寄生虫', '七武士', '喜宴', '新世界', '电锯惊魂', '哈利·波特与火焰杯', '模仿游戏', '黑客帝国3：矩阵革命', '花样年华', '卢旺达饭店', '上帝之城', '三块广告牌', '风之谷', '疯狂原始人', '你的名字。', '谍影重重3', '英雄本色', '头脑特工队', '达拉斯买家俱乐部', '纵横四海', '心迷宫', '岁月神偷', '记忆碎片', '惊魂记', '忠犬八公物语', '海街日记', '荒蛮故事', '九品芝麻官', '爆裂鼓手', '贫民窟的百万富翁', '真爱至上', '东邪西毒', '绿里奇迹', '小偷家族', '爱在午夜降临前', '无敌破坏王', '黑天鹅', '冰川时代', '你看起来好像很好吃', '疯狂的石头', '萤火虫之墓', '色，戒', '雨人', '雨中曲', '恐怖游轮', '恋恋笔记本', '魔女宅急便', '2001太空漫游', '城市之光', '可可西里', '虎口脱险', '人工智能', '二十二', '遗愿清单', '初恋这件小事', '海边的曼彻斯特', '大佛普拉斯', '奇迹男孩', '罗生门', '终结者2：审判日', '牯岭街少年杀人事件', '房间', '无间道2', '源代码', '青蛇', '东京教父', '新龙门客栈', '疯狂的麦克斯4：狂暴之路', '魂断蓝桥', '波西米亚狂想曲', '无耻混蛋', '步履不停', '血钻', '茶馆', '彗星来的那一夜', '千钧一发', '战争之王', '燃情岁月', '黑客帝国2：重装上阵', '谍影重重2', '崖上的波妞', '背靠背，脸对脸', '海洋', '小丑', '阿飞正传', '穿越时空的少女', '谍影重重', '地球上的星星', '香水', '再次出发之纽约遇见你', '完美陌生人', '我爱你', '爱乐之城', '朗读者', '火星救援', '聚焦', '小萝莉的猴神大叔', '驴得水', '浪潮', '猜火车', '千年女优']
1
2
3
4
5
6
7
8
9
10
11

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/运维做开发/article/detail/1006821

爬虫之静态页面抓取_python静态 爬虫数据