Gausst松鼠会

这个屌丝很懒，什么也没留下！

热门标签

Python爬虫入门学习（二）_python爬虫学习之(二)| 从零开始

作者：Gausst松鼠会 | 2024-03-20 08:31:37

踩

python爬虫学习之(二)| 从零开始

3. 验证码

3.1 request的高级用法

反爬机制：验证码。识别验证码中的图片数据，用于模拟登录操作。
1. 人工肉眼识别。
2. 第三方自动识别。（推荐）
  1. 云打码（未找到）
    1. 登录
    2. 普通用户登录，验证是否有积分
  2. 超级鹰
防盗链：溯源，当前本次请求的上级是谁。
http:/https 无状态。

没有请求到对应页面数据的原因：

发起的第二次基于个人主页页面请求的时候。服务器并不知道改此请求的是基于登陆状态下的请求。

cookie：用来服务器记录客户端的相关状态。

手动处理：通过抓包工具获取cookie，将该值封装到headers中。（不建议）

自动处理：

cookie 的来源?
- 模拟登录post请求后，由服务器端创建。
session 会话对象
- 作用
  1. 可以进行请求发送
  2. 如果请求过程中产生了cookie，则该cookie会被自动储存/携带在该session对象中。
创建一个seeion对象
使用session对象进行模拟登录post请求发送（cookie会被自动储存/携带在session对象中）

session 对象个人主页对应的get请求进行发送（携带了cookie）

import requests

if __name__ == '__main__':
    url = "https://www.pearvideo.com/video_1630895"
    video_id = url.split("_")[1]
    video_url = f"https://www.pearvideo.com/videoStatus.jsp?contId={video_id}&mrd=0.8090056640136296"
    headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                             "(KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36 Edg/107.0.1418.56",
               "Referer": f"https://www.pearvideo.com/video_{video_id}"# 防盗链
               }
    page = requests.get(url=video_url,headers=headers)
    page_dic = page.json()
    srcUrl =page_dic['videoInfo']['videos']['srcUrl']
    systemTime = page_dic['systemTime']
    srcUrl = srcUrl.replace(systemTime,f"cont-{video_id}")
    vidio = requests.get(url=srcUrl,headers=headers).content
    with open("a.mp4",mode='wb') as fp:
        fp.write(vidio)
    print("over!!")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

3.2 代理：

破解封IP这种的反扒机制。

什么是代理：

- 代理服务器。
1

代理的作用：

突破自身IP访问的限制。
可以隐藏自身真实的IP

代理相关网站 proxies = {“http”:“ip”}

快代理
西祠代理
www.goubanjia.com

代理IP的类型

http
https

代理IP的匿名度：

透明服务器知道该次请求使用了代理，也知道请求对应的真是ip
匿名：知道使用了代理，不知道真实iP
高逆：不知道使用了代理，更不知道使用了真实ip

4.异步爬虫

目的：在爬虫中使用异步实现高性能的数据爬取操作。

4.1 异步爬虫的方式

多线程，多进程(不建议)

好处：可以为相关阻塞的操作单独开启或者进程，阻塞操作就可以异步执行。

弊端：无法无限制的开启多线程或多进程。

进程池，线程池from multiprocessing.dummy import pool(适当的使用)

好处：我们可以降低系统对进程或线程创建和销毁的一个频率，从而很好的降低系统的开销。

弊端：池中线程或进程的数量是有上限的。

- 案例线程池爬取视频

import requests
from lxml import etree
from multiprocessing.dummy  import Pool

if __name__ == '__main__':
    url = "https://www.pearvideo.com/category_1"
    headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                             "(KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36 Edg/107.0.1418.56"
               }
    page_text = requests.get(url=url,headers= headers).text
    tree = etree.HTML(page_text)
    li_list =tree.xpath('//*[@id="listvideoListUl"]/li')
    video_list = []
    for li in li_list:
        vi_url = "https://www.pearvideo.com/"+li.xpath('./div/a/@href')[0]
        video_name= li.xpath('./div/a/div[2]/text()')[0]+'.mp4'
        id = vi_url.split("_")[1]
        headers['Referer']=f'https://www.pearvideo.com/video_{id}'
        video_url = f'https://www.pearvideo.com/videoStatus.jsp?contId={id}&mrd=0.8869584158606576'
        page_dic = requests.get(url= video_url,headers=headers).json()
        srcUrl = page_dic['videoInfo']['videos']['srcUrl']
        systemTime = page_dic['systemTime']
        srcUrl = srcUrl.replace(systemTime, f"cont-{id}")
        dic ={
            "name":video_name,
             "url":srcUrl
        }
        video_list.append(dic)
    def downlod(dic):
        print(dic['name']+'正在下载。。。。。。')
        vidio = requests.get(url=dic['url'], headers=headers).content
        with open(dic['name'],"wb") as fp:
            fp.write(vidio)
            print(dic['name']+'下载完成。。。。。')
    pool = Pool(4) # 创建进程池有个进程
    pool.map(downlod,video_list) # 映射到对应的函数
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

单线程+异步协程（推荐）
- event_loop: 事件循环，相当于一个无线循环，我们可以把一些函数注册到这事件循环上，当满足某些条件时候，函数就会被循环执行。
- coroutine：协程对象，我们可以将协程对象注册到时间循环中，他被事件循环调用。我们可以使用async 关键字来第一个方法，这个方法条用时不会被立即执行，而是返回一个协程对象。
- task：任务它是对协程对象的进一步封装，包含了任务的各个状态。
- future：代表将来执行还没有执行的任务，实际上和task没有本质的区别。
- async：定义一个协程
- await 用来挂起阻塞方法的执行。

4.2 aiohttp

基于异步的网络请求

5.selenium

1.简介：

便捷的获取网站中动态的加载数据
便捷实现模拟登录

基于浏览器自动化的一个模块。

2.使用流程：

环境的安装 pip install selenium
下载浏览器的驱动程序 http://npm.taobao.org/mirrors/chromedriver/

from selenium import webdriver
bro = webdriver.Chrome()

bro.get("https://www.baidu.com/")
# 运行后会自动关闭，加如等待语句可以等待
input()
1
2
3
4
5
6

其他使用
- 发起请求： get(url)
- 标签定位：find系列方法,新的版本已经更改具体用法。
```
driver.find_element(By.XPATH,""）
1
```
  标签交互： send_key(‘xxx’)
- 执行js程序：excute_script(“jsCode”)
- 前进，后退： back（）。forward()
- 关闭浏览器： quit（）

进入酷狗搜索并点击进入

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys 

# 创建浏览器对象
options = Options()
options.headless = False
driver = webdriver.Firefox(options=options)
# driver = webdriver.Firefox()  # # 不加 firefox_options 参数就是正常的打开一个浏览器，进行操作

driver.get('https://www.kugou.com/')
# print(driver.title)
# print(driver.current_url)
driver.find_element(By.XPATH,'/html/body/div[1]/div[1]/div/div[1]/div/input').send_keys('李白')
driver.find_element(By.XPATH,'/html/body/div[1]/div[1]/div/div[1]/div/input').send_keys(Keys.ENTER)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

selenium处理ifram

6. scrapy 框架

什么是scrapy

爬虫中的一个明星框架。功能：高性能持久储存，异步数据下载，共性能数据分析，分布式
scrapy的基本使用
- mac or linux ：pip install
- window：
  - pip install wheel
  - 下载twisted:下载地址https://www.lfd.uci.edu/~gohlke/pythonlibs//#twisted
  - 安装twisted:pip install Twisted-17.1.0-cp36（按python版本按装，不能报错）
  - pip install pywin32
  - pip install scrapy
  测试：在终端录入scrapy指令，没有报错即表示安装成功。
- 创建一个工程：scrapy startproject XXPro
- cd XXXPro 中
- 在spiders 子目录中创建一个爬虫文件
  - scrapy genspider spiderName www.xxx.com
- 执行工程
  - scrapy crawl spiderName

import scrapy


class FirstSpider(scrapy.Spider):
    #爬虫文件的名称：就是爬虫源文件的一个唯一标识
    name = 'first'
    # 允许的域名：用来限定start——urls列表中哪些可以进行请求发送
    #通常不会使用
   # allowed_domains = ['www.xxx.com']
    #起始的url列表：该列表存放的URL会被scrapy自动进行请求发送 
    start_urls = ['http://www.baidu.com/']

    # 用作数据解析：response参数表示就是请求成功后对应的响应数据
    def parse(self, response):
        pass

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

在这里插入图片描述

scrapy数据解析

import scrapy

class FirstSpider(scrapy.Spider):
    #爬虫文件的名称：就是爬虫源文件的一个唯一标识
    name = 'first'
    # 允许的域名：用来限定start——urls列表中哪些可以进行请求发送
    #通常不会使用
   # allowed_domains = ['www.xxx.com']
    #起始的url列表：该列表存放的URL会被scrapy自动进行请求发送
    start_urls = ['https://movie.douban.com/top250?start=0&filter=']

    # 用作数据解析：response参数表示就是请求成功后对应的响应数据
    def parse(self, response):
    # 解析：
        li_list = response.xpath('//*[@id="content"]/div/div[1]/ol/li')
        for li in li_list:
            #xpath 返回的是列表，但是列表元素一定是Selector类星的对象
            #extract 可以将Selector对象中data参数储存的字符串提取出来
            #列表调用了extract之后，则表示将列表中每一个Selector对象的字符串提取出来
            title = li.xpath("./div/div[2]/div/a/span[1]/text()").extract_first()
            print(title)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

持久化储存

基于终端指令：
- 要求：只可以将parse方法的返回值存储到本地的文本文件中。
- -注意：文本类型之恩呢为：‘json’, ‘jsonlines’, ‘jsonl’, ‘jl’, ‘csv’, ‘xml’, ‘marshal’, 'pickle
- 指令：scrapy crawl xxx-o filepath
- 好处：简洁高效便捷
- 缺点：局限比较强（数据这可以储存到指定后缀的文本文件。

import scrapy

class FirstSpider(scrapy.Spider):
    #爬虫文件的名称：就是爬虫源文件的一个唯一标识
    name = 'first'
    # 允许的域名：用来限定start——urls列表中哪些可以进行请求发送
    #通常不会使用
   # allowed_domains = ['www.xxx.com']
    #起始的url列表：该列表存放的URL会被scrapy自动进行请求发送
    start_urls = ['https://movie.douban.com/top250?start=0&filter=']

    # 用作数据解析：response参数表示就是请求成功后对应的响应数据
    def parse(self, response):
    # 解析：

        li_list = response.xpath('//*[@id="content"]/div/div[1]/ol/li')

        all_data = []
        for li in li_list:
            #xpath 返回的是列表，但是列表元素一定是Selector类星的对象
            #extract 可以将Selector对象中data参数储存的字符串提取出来
            #列表调用了extract之后，则表示将列表中每一个Selector对象的字符串提取出来
            title = li.xpath("./div/div[2]/div/a/span[1]/text()").extract_first()
            # print(title)
            dic = {
                'title':title
            }#要字典格式
            all_data.append(dic)
        # print(all_data)
        return all_data

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

基于管道：

编码流程：
在item类中定义相关的属性
将解析的数据封装到item类型的对象
将item类型的对象提交给管道进行持久化存储操作
在管道类的process_item中要将其接受到item对象的数据进行持久化存储操作。
在配置文件中开启管道
好处：通用性强

 管道文件中的一个管道类对应的时将数据储存到一个平台
 
 爬虫文件提交的item只会给管道类第一个被执行的管道类接收
 
 process_item 中的return item 表示将item传递给下一个即将被执行的管道类
1
2
3
4
5

frist.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class FirstprojectPipeline:
    fp = None
    # 重写父类方法：该方法只在开始爬虫的时候调用一次

    def open_spider(self,spider):
        print("开始爬虫。。。")
        self.fp = open('./top.txt','w',encoding='utf-8')
    #专门用来处理item类型的对象
    #该方法可以接受爬虫文件提交过来的item对象
    # 该方法每接到一个item就会被调用一次
    def process_item(self, item, spider):
        title = item['title']
        self.fp.write(title+'\n')
        return item

    def close_spider(self,spider):
        print('爬虫结束')
        self.fp.close()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class FirstprojectItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    pass

1
2
3
4
5
6
7
8
9
10
11
12
13

settings.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class FirstprojectItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    pass
1
2
3
4
5
6
7
8
9
10
11
12

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class FirstprojectPipeline:
    fp = None
    # 重写父类方法：该方法只在开始爬虫的时候调用一次

    def open_spider(self,spider):
        print("开始爬虫。。。")
        self.fp = open('./top.txt','w',encoding='utf-8')
    #专门用来处理item类型的对象
    #该方法可以接受爬虫文件提交过来的item对象
    # 该方法每接到一个item就会被调用一次
    def process_item(self, item, spider):
        title = item['title']
        self.fp.write(title+'\n')
        return item

    def close_spider(self,spider):
        print('爬虫结束')
        self.fp.close()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/Gausst松鼠会/article/detail/272015