当前位置:   article > 正文

爬虫python入门:获取网页数据_python请求网页,获取网页信息

python请求网页,获取网页信息

web数据获取

urllib模块应用

如何通过python获取网页数据

转码

 准备web页面素材

 启动httpd

通过apache的访问日志 发现是python进行的登录

解决为 urllib添加头部信息

  1. import urllib.request as u
  2. request = u.Request("http://192.168.86.11") #将网页地址添加到request实例(变量)
  3. request.add_header("User-Agent","Mozilla/5.0 \
  4. (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0") #为该实例添加头部信息
  5. response = u.urlopen(request) #通过urlopen打开实例(网页地址 和 头部信息)
  6. html = response.read()
  7. print(html) #访问页面

验证linux apache的日志

vim /var/log/httpd/access_log 查看信息记录是否还有python信息

下载图片的程序

  1. import urllib.request as u
  2. request = u.Request("http://192.168.86.11/style/\
  3. u24020836931378817798fm170s6BA8218A7B2128178FA0A49F010080E2w.jpg") #图片地址
  4. request.add_header("User-Agent","Mozilla/5.0 \
  5. (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0")
  6. response = u.urlopen(request)
  7. html = response.read() #读取的图片的2进制数据
  8. #print(html)
  9. with open("c:\\users\\allen\\desktop\\爬虫.jpg","wb") as f:
  10. f.write(html)

需要将网页信息获取程序转换为函数模式

  1. import urllib.request as u
  2. url = "http://192.168.86.11"
  3. def get_html(urladdr):
  4. "我的功能是获取主页的所有源码"
  5. request = u.Request(urladdr)
  6. request.add_header("User-Agent","Mozilla/5.0 \
  7. (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0")
  8. response = u.urlopen(request)
  9. html = response.read()
  10. return html
  11. def get_imglist():
  12. "我的功能是将所有图片信息地址 做成一个大的列表"
  13. pass
  14. def get_imgs():
  15. "我的功能是下载图片列表中的所有 图片信息 并保存图片"
  16. pass
  17. html = get_html(url)
  18. print(html)

如何用正则匹配字符串

单个字符匹配

"." 匹配单个任意字符

  1. >>> import re
  2. >>> re.findall(".ood","I say Good not food")
  3. ['Good', 'food']
  4. >>> re.findall(".ood","I say Good not food @ood")
  5. ['Good', 'food', '@ood']
  6. >>> re.findall(".ood","I say Good not food ood")
  7. ['Good', 'food', ' ood']
  8. >>> re.findall(".ood","I say Good not food \nood")
  9. ['Good', 'food']
  10. >>>

[] 单个字符逐一匹配

  1. >>> re.findall("[fn]ood","I say Good not food nood") #ood以f或者n链接 的字符串
  2. ['food', 'nood']
  3. >>> re.findall("[^fn]ood","I say Good not food nood")#ood不是以f或者n链接的字符串 取反
  4. ['Good']
  5. >>> re.findall("^[Gfn]ood","Good not food nood") #以G f n 开头的和ood链接的字符串匹配
  6. ['Good']
  7. >>> re.findall("^[Gfn]ood","I say Good not food nood")
  8. []
  9. >>>

\d 匹配单个0-9

  1. >>> re.findall("\d","How old are you? I am 36")
  2. ['3', '6']
  3. >>> re.findall("\d\d","How old are you? I am 36")
  4. ['36']
  5. >>>

\w 匹配0-9a-zA-Z_ 该范围内的单个字符

  1. >>> re.findall("\w","How old are you? I am 36")
  2. ['H', 'o', 'w', 'o', 'l', 'd', 'a', 'r', 'e', 'y', 'o', 'u', 'I', 'a', 'm', '3', '6']
  3. >>> re.findall("\w\w\w","How old are you? I am 36")
  4. ['How', 'old', 'are', 'you']
  5. >>> re.findall("\w\w","How old are you? I_am 36")
  6. ['Ho', 'ol', 'ar', 'yo', 'I_', 'am', '36']
  7. >>>

\s 匹配空白字符以及空格

  1. >>> re.findall("\s","\tHow old are you?\r\n")
  2. ['\t', ' ', ' ', ' ', '\r', '\n']
  3. >>>

一组字符匹配

逐字匹配

  1. >>> re.findall("allen","I am allen")
  2. ['allen']
  3. >>> re.findall("allen","I am allenallen")
  4. ['allen', 'allen']
  5. >>>

逐字匹配 | 分割不同的字符串

  1. >>> re.findall("food|nood","I say Good not food nood")
  2. ['food', 'nood']
  3. >>> re.findall("not|nood","I say Good not food nood")
  4. ['not', 'nood']
  5. >>>

*表示左邻第一个字符 出现0次到无穷次

  1. >>> re.findall("go*gle","I like google not ggle goooogle and gogle")
  2. ['google', 'ggle', 'goooogle', 'gogle']
  3. >>>

+表示左邻第一个字符 出现1次到无穷次

  1. >>> re.findall("go+gle","I like google not ggle goooogle and gogle")
  2. ['google', 'goooogle', 'gogle']
  3. >>>

?表示左邻第一个字符 出现0次或1次

  1. >>> re.findall("go?gle","I like google not ggle goooogle and gogle")
  2. ['ggle', 'gogle']

{}指定左邻字符出现的次数

  1. >>> re.findall("go{2}gle","I like google not ggle goooogle and gogle")
  2. ['google']
  3. >>> re.findall("go{1}gle","I like google not ggle goooogle and gogle")
  4. ['gogle']
  5. >>> re.findall("go{1,4}gle","I like google not ggle goooogle and gogle")
  6. ['google', 'goooogle', 'gogle']
  7. >>>

根据以上信息完成web测试页面图片获取

  1. import urllib.request as u
  2. import re
  3. url = "http://192.168.86.11/" #结尾添加左斜杠
  4. def get_html(urladdr):
  5. "我的功能是获取主页的所有源码"
  6. request = u.Request(urladdr)
  7. request.add_header("User-Agent","Mozilla/5.0 \
  8. (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0")
  9. response = u.urlopen(request)
  10. html = response.read()
  11. return html
  12. def get_imglist(url,html):
  13. "我的功能是将所有图片信息地址 做成一个大的列表"
  14. imglist = [] #存储图片地址的一个容器列表
  15. bytsimglist = re.findall(b"style/\w{60}\.jpg",html)
  16. for i in bytsimglist: #因为图片地址不全而且是2进制字符串 因此 要进行拼接处理
  17. imgaddr = url+str(i,encoding='utf8') #拼接并且转换为字符串
  18. imglist.append(imgaddr) #将地址放入列表中
  19. return imglist
  20. def get_imgs(imglist):
  21. "我的功能是下载图片列表中的所有 图片信息 并保存图片"
  22. num = 0 #为了图片名称进行自增
  23. for imgurl in imglist:
  24. num += 1
  25. data = get_html(imgurl)
  26. with open("%s.jpg" %num,"wb") as f: #图片名字会从1.jpg开始一直到54.jpg
  27. f.write(data)
  28. html = get_html(url)
  29. #print(html)
  30. imglist = get_imglist(url,html)
  31. #print(len(imglist))
  32. get_imgs(imglist)

布卡漫画网站 资源爬取

  1. import urllib.request as u
  2. import re
  3. #url = "http://www.buka.cn/view/223172/65537.html"
  4. #url = "http://www.buka.cn/view/223578/65537.html"
  5. #url = "http://www.buka.cn/view/221784/65540.html"
  6. url = "http://www.buka.cn/view/219792/65742.html"
  7. def get_html(urladdr):
  8. "我的功能是获取主页的所有源码"
  9. request = u.Request(urladdr)
  10. request.add_header("User-Agent","Mozilla/5.0 \
  11. (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0")
  12. response = u.urlopen(request)
  13. html = response.read()
  14. return html
  15. def get_imglist(url,html):
  16. "我的功能是将所有图片信息地址 做成一个大的列表"
  17. imglist = [] #存储图片地址的一个容器列表
  18. bytsimglist = re.findall(b"http://i-cdn.ibuka.cn/pics/\d+/\d+/\w+.jpg",html)
  19. #print(bytsimglist)
  20. for i in bytsimglist:
  21. imglist.append(str(i,encoding='utf8'))
  22. return imglist
  23. def get_imgs(imglist):
  24. "我的功能是下载图片列表中的所有 图片信息 并保存图片"
  25. num = 0 #为了图片名称进行自增
  26. for imgurl in imglist:
  27. num += 1
  28. data = get_html(imgurl)
  29. with open("%s.jpg" %num,"wb") as f: #图片名字会从1.jpg开始一直到54.jpg
  30. f.write(data)
  31. html = get_html(url)
  32. #print(html)
  33. imglist = get_imglist(url,html)
  34. #print(imglist)
  35. get_imgs(imglist)

正则匹配中特殊符号应用

^表示已什么开头 $以什么结尾

  1. >>> re.findall('^I say',"I say Good not food")
  2. ['I say']
  3. >>> re.findall('not food$',"I say Good not food")
  4. ['not food']
  5. >>> re.findall('not Good$',"I say Good not food")
  6. []
  7. >>>

\b 指定单词边界 _不属于特殊符号

  1. >>> re.findall("allen","allen.com allen_123 allen.com")
  2. ['allen', 'allen', 'allen']
  3. >>> re.findall("\ballen\b","allen.com allen_123 allen.com")
  4. []
  5. >>> re.findall("\\ballen\\b","allen.com allen_123 allen.com")
  6. ['allen', 'allen']
  7. >>>
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/木道寻08/article/detail/868510
推荐阅读
相关标签
  

闽ICP备14008679号