赞
踩
这几天看代码,总是会接触到很多异步编程,之前只想着实现功能,从来没考虑过代码的运行快慢问题,故学习一番。
从0到1,了解python异步编程的演进
1、urllib与requests爬虫
requests对请求做了优化,因此比urllib快一点。
Requests是Python中的HTTP客户端库,网络请求更加直观方便,它与Urllib最大的区别就是在爬取数据的时候连接方式的不同。urllb爬取完数据是直接断开连接的,而requests爬取数据之后可以继续复用socket,并没有断开连接。
在python2.7版本下,Python urllib模块分为两部分,urllib和urllib2。Python3.5 版本下将python2.7版本的urllib和urllib2 合并在一起成一个新的urllib。
urllib:
#-*- coding:utf-8 -*-
import urllib.request
import ssl
from lxml import etree
url = 'https://movie.douban.com/top250'
context = ssl.SSLContext(ssl.PROTOCOL_TLSv1_1)
def fetch_page(url):
response = urllib.request.urlopen(url, context=context)
return response
def parse(url):
response = fetch_page(url)
page = response.read()
html = etree.HTML(page)
xpath_movie = '//*[@id="content"]/div/div[1]/ol/li'
xpath_title = './/span[@class="title"]'
xpath_pages = '//*[@id="content"]/div/div[1]/div[2]/a'
pages = html.xpath(xpath_pages)
fetch_list = []
result = []
for element_movie in html.xpath(xpath_movie):
result.append(element_movie)
for p in pages:
fetch_list.append(url + p.get('href'))
for url in fetch_list:
response = fetch_page(url)
page = response.read()
html = etree.HTML(page)
for element_movie in html.xpath(xpath_movie):
result.append(element_movie)
for i, movie in enumerate(result, 1):
title = movie.find(xpath_title).text
print(i, title)
def main():
parse(url)
if __name__ == '__main__':
main()
requests代替标准库urllib:
import requests
from lxml import etree
from time import time
url = 'https://movie.douban.com/top250'
def fetch_page(url):
response = requests.get(url)
return response
def parse(url):
response = fetch_page(url)
page = response.content
html = etree.HTML(page)
xpath_movie = '//*[@id="content"]/div/div[1]/ol/li'
xpath_title = './/span[@class="title"]'
xpath_pages = '//*[@id="content"]/div/div[1]/div[2]/a'
pages = html.xpath(xpath_pages)
fetch_list = []
result = []
for element_movie in html.xpath(xpath_movie):
result.append(element_movie)
for p in pages:
fetch_list.append(url + p.get('href'))
for url in fetch_list:
response = fetch_page(url)
page = response.content
html = etree.HTML(page)
for element_movie in html.xpath(xpath_movie):
result.append(element_movie)
for i, movie in enumerate(result, 1):
title = movie.find(xpath_title).text
# print(i, title)
2、lxml库与正则表达式进行解析
lxml库进行解析需要一定时间,但依赖正则表达式的程序会更加难以维护,扩展性不高。
常见的组合是Requests+BeautifulSoup(解析网络文本的工具库),解析工具常见的还有正则,xpath。
将lxml库换成标准的re库:
#-*- coding:utf-8 -*-
import requests
from time import time
import re
url = 'https://movie.douban.com/top250'
def fetch_page(url):
response = requests.get(url)
return response
def parse(url):
response = fetch_page(url)
page = response.content
fetch_list = set()
result = []
for title in re.findall(rb'(.*)', page):
result.append(title)
fetch_list.add(url + postfix.decode())
for i, title in enumerate(result, 1):
网络应用方面的编程(如上例中的爬虫),通常瓶颈都在IO层面,解决等待读写的问题比提高文本解析速度来的更有性价比。
3)线程与进程的区别:线程一般以并发执行,正是由于这种并发和数据共享机制,使多任务间的协作成为可能。进程一般以并行执行,这种并行能使得程序能同时在多个CPU上运行。
url = 'https://movie.douban.com/top250'
xpath_movie = '//*[@id="content"]/div/div[1]/ol/li'
xpath_title = './/span[@class="title"]'
xpath_pages = '//*[@id="content"]/div/div[1]/div[2]/a'
pages = html.xpath(xpath_pages)
for element_movie in html.xpath(xpath_movie):
fetch_list.append(url + p.get('href'))
for element_movie in html.xpath(xpath_movie):
t = Thread(target=fetch_content, args=[url])
for i, movie in enumerate(result, 1):
title = movie.find(xpath_title).text
from concurrent.futures import ProcessPoolExecutor
url = 'https://movie.douban.com/top250'
xpath_movie = '//*[@id="content"]/div/div[1]/ol/li'
xpath_title = './/span[@class="title"]'
xpath_pages = '//*[@id="content"]/div/div[1]/div[2]/a'
pages = html.xpath(xpath_pages)
for element_movie in html.xpath(xpath_movie):
fetch_list.append(url + p.get('href'))
with ProcessPoolExecutor(max_workers=4) as executor:
for page in executor.map(fetch_content, fetch_list):
for element_movie in html.xpath(xpath_movie):
for i, movie in enumerate(result, 1):
title = movie.find(xpath_title).text
这里多进程带来的优点(cpu处理)并没有得到体现,反而创建和调度进程带来的开销要远超出它的正面效应,拖了一把后腿。即便如此,多进程带来的效益相比于之前单进程单线程的模型要好得多。
url = 'https://movie.douban.com/top250'
xpath_movie = '//*[@id="content"]/div/div[1]/ol/li'
xpath_title = './/span[@class="title"]'
xpath_pages = '//*[@id="content"]/div/div[1]/div[2]/a'
pages = html.xpath(xpath_pages)
for element_movie in html.xpath(xpath_movie):
fetch_list.append(url + p.get('href'))
jobs = [gevent.spawn(fetch_content, url) for url in fetch_list]
for page in [job.value for job in jobs]:
for element_movie in html.xpath(xpath_movie):
for i, movie in enumerate(result, 1):
title = movie.find(xpath_title).text
Python需要一个独立的标准库来支持协程,于是就有了后来的asyncio。
把同步的requests库改成了支持asyncio的aiohttp库,使用3.5的async/await语法编写协程版本的例子。
url = 'https://movie.douban.com/top250'
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
page = await fetch_content(url)
xpath_movie = '//*[@id="content"]/div/div[1]/ol/li'
xpath_title = './/span[@class="title"]'
xpath_pages = '//*[@id="content"]/div/div[1]/div[2]/a'
pages = html.xpath(xpath_pages)
for element_movie in html.xpath(xpath_movie):
fetch_list.append(url + p.get('href'))
tasks = [fetch_content(url) for url in fetch_list]
pages = await asyncio.gather(*tasks)
for element_movie in html.xpath(xpath_movie):
for i, movie in enumerate(result, 1):
title = movie.find(xpath_title).text
loop = asyncio.get_event_loop()
loop.run_until_complete(parse(url))
print ('Cost {} seconds'.format((end - start) / 5))
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。