小丑西瓜9

这个屌丝很懒，什么也没留下！

热门标签

Python学习笔记(21), 常用内置模块，contextlib, urllib,HTMLParser

作者：小丑西瓜9 | 2024-02-22 13:24:30

踩

文章目录

Built-in modules

Built-in modules

contextlib

在读写文件时，打开文件，使用完毕后要正确的关闭它，一种方式是使用try...finally，另一种更方便的方式是使用with open(filename, 'r') as f:

实际上，任何对象，只要正确实现了上下文管理，就可以用with语句。实现上下文管理通过__enter__和__exit__这个两个方法来实现

class Query:

    def __init__(self, name):
        self.name = name
    
    def __enter__(self):
        print('Begin')
        return self

    def __exit__(self, exc_type, exc_value, traceback):
        if exc_type:
            print('Error')
        else:
            print('End')
        
    def query(self):
        print('Query info about %s...' % self.name)

with Query('Bob') as q:
    q.query()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

Begin
Query info about Bob...
End
1
2
3

@contextmanager

编写__enter__和__exit__依然繁琐，contextlib提供了更简单的写法：

from contextlib import contextmanager

class Query:
    def __init__(self, name):
        self.name = name
    
    def query(self):
        print('Query info about %s...' % self.name)

@contextmanager
def create_query(name):
    print('Begin')
    q = Query(name)
    yield q
    print('End')

with create_query('Bob') as q:
    q.query()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

Begin
Query info about Bob...
End
1
2
3

@contextmanager这个decorator接受一个generator，用yield语句把with ... as var把变量输出出去，然后，with语句就可以正常地工作。

很多时候，希望在某段代码执行前后自动执行特定代码，也可以用@contextmanager来实现

@contextmanager
def tag(name):
    print("<%s>" % name)
    yield
    print("</%s>" % name)

with tag("h1"):
    print("hello")
    print("world")
    
1
2
3
4
5
6
7
8
9
10

<h1>
hello
world
</h1>
1
2
3
4

代码的执行顺序是：

with语句首先执行yield之前的语句，打印<h1>
yield调用会执行with语句内部的所有语句，打印出hello和world
最后执行yield之后的语句，打印出</h1>

@closing

from contextlib import closing
from urllib.request import urlopen

@contextmanager
def closing(thing):
    try:
        yield thing
    finally:
        thing.close()

with closing(urlopen('https://www.python.org')) as page:
    for line in page:
       pass # print(line) will print every line this website page.

1
2
3
4
5
6
7
8
9
10
11
12
13
14

urllib

Get

urllib的request模块可以方便的抓取URL内容，发送一个GET请求到指定的页面，然后返回HTTP的响应

from urllib import request
with request.urlopen('https://www.python.org') as f:
    data = f.read()
    print('Status:',f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    # print('Data', data.decode('utf-8'))

1
2
3
4
5
6
7
8

Status: 200 OK
Server: nginx
Content-Type: text/html; charset=utf-8
X-Frame-Options: DENY
Via: 1.1 vegur
Via: 1.1 varnish
Content-Length: 49123
Accept-Ranges: bytes
Date: Tue, 05 May 2020 18:46:26 GMT
Via: 1.1 varnish
Age: 769
Connection: close
X-Served-By: cache-bwi5122-BWI, cache-mdw17348-MDW
X-Cache: HIT, HIT
X-Cache-Hits: 4, 2
X-Timer: S1588704387.816773,VS0,VE0
Vary: Cookie
Strict-Transport-Security: max-age=63072000; includeSubDomains
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

如果想要模拟浏览器发送GET请求，就需要使用Request对象，通过往Request对象添加HTTP头，我们就可以把请求伪装成浏览器。例如，模拟iPhone 6去请求豆瓣首页

from urllib import request

req = request.Request('http://www.douban.com/')
req.add_header('User-Agent', 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25')
with request.urlopen(req) as f:
    print('Status:', f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    print('Data:', f.read().decode('utf-8'))
1
2
3
4
5
6
7
8
9

Status: 200 OK
Date: Tue, 05 May 2020 18:49:12 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: close
Vary: Accept-Encoding
Vary: Accept-Encoding
X-Xss-Protection: 1; mode=block
X-Douban-Mobileapp: 0
Expires: Sun, 1 Jan 2006 01:00:00 GMT
Pragma: no-cache
Cache-Control: must-revalidate, no-cache, private
Set-Cookie: bid=0uXd--_Fj8w; Expires=Wed, 05-May-21 18:49:12 GMT; Domain=.douban.com; Path=/
X-DOUBAN-NEWBID: 0uXd--_Fj8w
X-DAE-App: talion
X-DAE-Instance: default
Server: dae
Strict-Transport-Security: max-age=15552000
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
Data: 

<!DOCTYPE html>
<html itemscope itemtype="http://schema.org/WebPage" class="ua-safari ua-mobile ">
  <head>
      <meta charset="UTF-8">
      <title>豆瓣(手机版)</title>
      <meta name="google-site-verification" content="ok0wCgT20tBBgo9_zat2iAcimtN4Ftf5ccsh092Xeyw" />
      <meta name="viewport" content="width=device-width, height=device-height, user-scalable=no, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0">
     ...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

POST

如果要以POST发送一个请求，只需要把参数data以bytes形式传入。

模拟一个微博登录，先读取登录的邮箱和口令，然后按照weibo.cn的登录页格式以username-xxx&password=xxx的编码传入

from urllib import request, parse
print('Login to weibo.cn')

email = input('Email: ')
passwd = input('Password: ')
# Encode a dict or sequence of two-element tuples into a URL query string.
login_data = parse.urlencode([
    ('username', email),
    ('password', passwd),
    ('entry', 'mweibo'),
    ('client_id', ''),
    ('savestate', 1),
    ('ec', ''),
    ('pagerefer', 'https://passport.weibo.cn/signin/welcome?entry=mweibo&r=http%3A%2F%2Fm.weibo.cn%2F')
])

req = request.Request('https://passport.weibo.cn/sso/login')
req.add_header('Origin', 'https://passport.weibo.cn')
req.add_header('User-Agent', 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25')
req.add_header('Referer', 'https://passport.weibo.cn/signin/login?entry=mweibo&res=wel&wm=3349&r=http%3A%2F%2Fm.weibo.cn%2F')

with request.urlopen(req, data=login_data.encode('utf-8')) as f:
    print('Status:', f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    print('Data:', f.read().decode('utf-8'))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

Login to weibo.cn
Status: 200 OK
Server: nginx/1.6.1
Date: Tue, 05 May 2020 18:57:36 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: close
Vary: Accept-Encoding
Cache-Control: no-cache, must-revalidate
Expires: Sat, 26 Jul 1997 05:00:00 GMT
Pragma: no-cache
Access-Control-Allow-Origin: https://passport.weibo.cn
Access-Control-Allow-Credentials: true
DPOOL_HEADER: localhost.localdomain
Data: {"retcode":50011007,"msg":"\u8bf7\u8f93\u5165\u7528\u6237\u540d","data":{"errline":320}}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

登录失败

Handler

如果还需要更复杂的控制，比如通过一个Proxy去访问网站，我们需要利用ProxyHandler来处理

小结

urilib提供的功能就是利用程序去执行各种HTTP请求。如果要模拟浏览器完成特定功能，需要把请求伪装成浏览器。伪装的方法就是先监控浏览器发出的请求，然后根据浏览器的请求头来伪装，User-Agent头就是用来标识浏览器的。

更详细的解释

XML

from xml.parsers.expat import ParserCreate

class DefaultSaxHandler(object):
    def start_element(self, name, attrs):
        print('sax:start_element: %s, attrs: %s' % (name, str(attrs)))

    def end_element(self, name):
        print('sax:end_element: %s' % name)

    def char_data(self, text):
        print('sax:char_data: %s' % text)

xml = r'''<?xml version="1.0"?>
<ol>
    <li><a href="/python">Python</a></li>
    <li><a href="/ruby">Ruby</a></li>
</ol>
'''

handler = DefaultSaxHandler()
parser = ParserCreate()
parser.StartElementHandler = handler.start_element
parser.EndElementHandler = handler.end_element
parser.CharacterDataHandler = handler.char_data
parser.Parse(xml)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

sax:start_element: ol, attrs: {}
sax:char_data: 

sax:char_data:     
sax:start_element: li, attrs: {}
sax:start_element: a, attrs: {'href': '/python'}
sax:char_data: Python
sax:end_element: a
sax:end_element: li
sax:char_data: 

sax:char_data:     
sax:start_element: li, attrs: {}
sax:start_element: a, attrs: {'href': '/ruby'}
sax:char_data: Ruby
sax:end_element: a
sax:end_element: li
sax:char_data: 

sax:end_element: ol
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

HTMLParser

from html.parser import HTMLParser
from html.entities import name2codepoint

class MyHTMLParser(HTMLParser):

    def handle_starttag(self, tag, attrs):
        print('<%s>' % tag)

    def handle_endtag(self, tag):
        print('</%s>' % tag)

    def handle_startendtag(self, tag, attrs):
        print('<%s/>' % tag)

    def handle_data(self, data):
        print(data)

    def handle_comment(self, data):
        print('<!--', data, '-->')

    def handle_entityref(self, name):
        print('&%s;' % name)

    def handle_charref(self, name):
        print('&#%s;' % name)

parser = MyHTMLParser()
parser.feed('''<html>
<head></head>
<body>
<!-- test html parser -->
    <p>Some <a href=\"#\">html</a> HTML&nbsp;tutorial...<br>END</p>
</body></html>''')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

<html>


<head>
</head>


<body>


<!--  test html parser  -->

    
<p>
Some 
<a>
html
</a>
 HTML tutorial...
<br>
END
</p>


</body>
</html>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

练习

找一个网页，例如https://www.python.org/events/python-events/，
用浏览器查看源码并复制，然后尝试解析一下HTML，输出Python官网发布的会议时间、名称和地点。

from html.parser import HTMLParser
from html.entities import name2codepoint
from urllib import request
from urllib.request import urlopen

class EventHTMLParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.tag = None
    
    def handle_starttag(self, tag, attrs):
        if ('class', 'event-title') in attrs:
            self.tag = 'Event-title'
        if tag == 'time':
            self.tag = 'Time'
        if ('class', 'say-no-more') in attrs:
            self.tag = 'Year'
        elif ('class', 'event-location') in attrs:
            self.tag = 'Event-location'
    
    def handle_data(self, data):
        if self.tag:
            print(self.tag, data)
    
    def handle_endtag(self, data):
        if self.tag:
            self.tag = None


with urlopen('https://www.python.org/events/python-events') as f:
    html_data = str(f.read())

parser = EventHTMLParser()
parser.feed(html_data)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

Event-title PyConWeb 2020 (canceled)
Time 09 May – 10 May 
Year  2020
Event-location Munich, Germany
Event-title Django Girls Groningen
Time 16 May
Year  2020
Event-location Groningen, Netherlands
Event-title PyLondinium 2020 (postponed)
Time 05 June – 07 June 
Year  2020
Event-location London, UK
Event-title PyCon CZ 2020 (canceled)
Time 05 June – 07 June 
Year  2020
Event-location Ostrava, Czech Republic
Event-title PyCon Odessa 2020
Time 13 June – 14 June 
Year  2020
Event-location Odessa, Ukraine
Event-title Python Web Conference 2020 (Online-Worldwide)
Time 17 June – 19 June 
Year  2020
Event-location https://2020.pythonwebconf.com
Event-title Python Meeting D\xc3\xbcsseldorf
Time 01 July
Year  2020
Event-location D\xc3\xbcsseldorf, Germany
Event-title SciPy 2020
Time 06 July – 12 July 
Year  2020
Event-location Online
Event-title Python Nordeste 2020
Time 17 July – 19 July 
Year  2020
Event-location Fortaleza, Cear\xc3\xa1, Brasil
Event-title EuroPython 2020 (in-person: canceled, considering going virtual)
Time 20 July – 26 July 
Year  2020
Event-location https://blog.europython.eu/post/612826526375919616/europython-2020-going-virtual-europython-2021
Event-title EuroPython 2020 Online
Time 23 July – 26 July 
Year  2020
Event-location Online Event
Event-title EuroSciPy 2020 (canceled)
Time 27 July – 31 July 
Year  2020
Event-location Bilbao, Spain
Event-title PyCon JP 2020
Time 28 Aug. – 29 Aug. 
Year  2020
Event-location Tokyo, Japan
Event-title PyCon TW 2020
Time 05 Sept. – 06 Sept. 
Year  2020
Event-location International Conference Hall ,No.1, University Road, Tainan City 701, Taiwan
Event-title PyCon SK 2020
Time 11 Sept. – 13 Sept. 
Year  2020
Event-location Bratislava, Slovakia
Event-title DjangoCon Europe 2020
Time 16 Sept. – 20 Sept. 
Year  2020
Event-location Porto, Portugal
Event-title DragonPy 2020
Time 19 Sept. – 20 Sept. 
Year  2020
Event-location Ljubljana, Slovenia
Event-title PyCon APAC 2020
Time 19 Sept. – 20 Sept. 
Year  2020
Event-location Kota Kinabalu, Sabah, Malaysia
Event-title Django Day Copenhagen
Time 25 Sept.
Year  2020
Event-location Copenhagen, Denmark
Event-title PyCon Turkey
Time 26 Sept. – 27 Sept. 
Year  2020
Event-location Albert Long Hall, at Bogazici University Istanbul
Event-title Python Meeting D\xc3\xbcsseldorf
Time 30 Sept.
Year  2020
Event-location D\xc3\xbcsseldorf, Germany
Event-title PyCon India 2020
Time 02 Oct. – 05 Oct. 
Year  2020
Event-location Bangalore, India
Event-title PyConDE & PyData Berlin 2020
Time 14 Oct. – 16 Oct. 
Year  2020
Event-location Berlin, Germany
Event-title Swiss Python Summit
Time 23 Oct.
Year  2020
Event-location Rapperswil, Switzerland
Event-title PyCC Meetup'19 (Python Cape Coast User Group)
Time 26 Oct.
Year  2020
Event-location Cape coast, Ghana
Event-title Python Brasil 2020
Time 28 Oct. – 02 Nov. 
Year  2020
Event-location Caxias do Sul, RS, Brazil
Event-title PyData London 2020
Time 30 Oct. – 01 Nov. 
Year  2020
Event-location London, UK
Event-title PyCon Italia 2020
Time 05 Nov. – 08 Nov. 
Year  2020
Event-location Florence, Italy
Event-title enterPy
Time 23 Nov. – 24 Nov. 
Year  2020
Event-location Mannheim, Germany
Event-title PyCon US 2021
Time 12 May – 20 May 
Year  2021
Event-location Pittsburgh, PA, USA
Event-title SciPy 2021
Time 12 July – 18 July 
Year  2021
Event-location Austin, TX, US
Event-title EuroPython 2021
Time 26 July – 01 Aug. 
Year  2021
Event-location Dublin, Ireland
Year General
Year Initiatives
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/小丑西瓜9/article/detail/130282