当前位置:   article > 正文

python爬取静态网页数据_python网络爬虫(1)静态网页抓取

如何使用python爬取静态网页数据

获取响应内容:

import requests

r=requests.get('http://www.santostang.com/')

print(r.encoding)

print(r.status_code)

print(r.text)

获取编码,状态(200成功,4xx客户端错误,5xx服务器相应错误),文本,等。

定制Request请求

传递URL参数

key_dict = {'key1':'value1','key2':'value2'}

r=requests.get('http://httpbin.org/get',params=key_dict)

print(r.url)

print(r.text)

定制请求头

headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0','Host':'www.santostang.com'}

r=requests.get('http://www.santostang.com',headers=headers)

print(r.status_code)

发送POST请求

POST请求发送表单信息,密码不显示在URL中,数据字典发送时自动编码为表单形式。

key_dict = {'key1':'value1','key2':'value2'}

r=requests.post('http://httpbin.org/post',data=key_dict)

print(r.url)

print(r.text)

超时并抛出异常

r=requests.get('http://www.santostang.com/',timeout=0.11)

获取top250电影数据

import requests

import myToolFunction

from bs4 import BeautifulSoup

def get_movies():

headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0','Host':'movie.douban.com'}

movie_list=[]

for i in range(10):

link='https://movie.douban.com/top250'

key_dict = {'start':i*25,'filter':''}

r=requests.get(link,params=key_dict)

#print(r.text)

print(r.status_code)

print(r.url)

soup=BeautifulSoup(r.text,'lxml')

div_list=soup.find_all('div', class_='hd')

for each in div_list:

movie=each.a.span.text.strip()+'\n'

movie_list.append(movie)

pass

return movie_list

def storFile(data,fileName,method='a'):

with open(fileName,method,newline ='') as f:

f.write(data)

pass

pass

movie_list=get_movies()

for str in movie_list:

myToolFunction.storFile(str, 'movie top250.txt','a')

pass

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/在线问答5/article/detail/1006834
推荐阅读
相关标签
  

闽ICP备14008679号