赞
踩
一、网络爬虫过程
步骤1、分析需求
步骤2、根据需求,选取网页(指定URL地址)
步骤3、网站数据获取到本地
步骤4、定位数据
步骤5、数据存储(MySQL,Redis)
二、代码实现
- 步骤1. 传入url
- 步骤2. user_agent
- 步骤3. headers
- 步骤4.定义Request
- 步骤5.urlopen
- 步骤6. 返回byte数组
1、导入包
- #导包
- from urllib import request, parse
- from urllib.error import HTTPError, URLError
3、定义请求方式函数
- #定义get请求函数
- def get(url,headers=None):
- return urlrequests(url,headers=headers)
-
- #定义post请求函数
- def post(url,form,headers=None):
- return urlrequests(url,form,headers=headers)
2、对爬虫进行简单封装
-
- #爬虫封装函数
- def urlrequests(url,form=None,headers=None):
- #模拟浏览器
- user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
-
- if headers == None:
- headers = {'User-Agent':user_agent}
-
- html_bytes = b''
- try:
- if form:
- #POST请求方式
- #(1):转换成str格式
- form_str = parse.urlencode(form)
- #(2):转换成bytes类型
- form_bytes = form_str.encode('utf-8')
- #去网站访问数据
- req = request.Request(url,data=form_bytes,headers=headers)
- #指定写入文件
- with open('fanyi.html','wb') as f:
- f.write(html_bytes)
- else:
- #GET请求方式
- req = request.Request(url,headers=headers)
- response = request.urlopen(req)
- html_bytes = response.read()
- except HTTPError as e :
- print(e)
- except URLError as e :
- print(e)
-
- return html_bytes
-
-
- if __name__ == '__main__':
- url = 'http://fanyi.baidu.com/sug/'
- form = {'kw':'汽车'}
- html_bytes = post(url,form)
- print(html_bytes)

4、输出结果
b'{"errno":0,"data":[{"k":"\\u6c7d\\u8f66","v":"[q\\u00ec ch\\u0113] car; automobile; auto; motor vehicle; aut"},{"k":"\\u6c7d\\u8f66\\u7ad9","v":"bus station;"},{"k":"\\u6c7d\\u8f66\\u5c3e\\u6c14","v":"\\u540d automobile exhaust; vehicle exhaust;"},{"k":"\\u6c7d\\u8f66\\u4eba","v":"\\u540d Autobots;"},{"k":"\\u6c7d\\u8f66\\u914d\\u4ef6","v":"auto parts;"}]}'
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。