当前位置:   article > 正文

第4章内容-启动豆果美食并抓包_豆果美食app爬取

豆果美食app爬取

4-1抓取前设置,启动豆果美食app并抓包

安装豆果美食(QQ分享至模拟器)→打开Fidller并remove all sessions→启动豆果美食→点击菜谱分类→点击红烧肉→点击学做多→查看数据包
点击菜谱分类
点击红烧肉
点击学做多
检查数据包
问题:提示net::ERR_PROXY_CONNECTION_FAILED
解决方法:重启FIddler


4-2分析fiddler抓取到的豆果美食数据包

打开Fiddler→点击find→输入api.douguo.net→点击Find Sessions进行高亮显示→保留菜谱分类数据包、土豆学做多前40条数据包
进行高亮显示api.douguo.net相关数据包
分析数据包


总任务

  1. 分析豆果美食数据包
  2. 将数据保存到mongodb中
  3. 通过python多线程-线程池抓取数据
  4. 通过使用代理ip隐藏爬虫

4-3编写爬虫脚本1-项目需求、请求头伪造

4-4编写爬虫脚本2-食材页面解析、队列逻辑编写

任务:分析豆果美食数据包

豆果美食菜单分析
豆果美食菜单json格式
豆果美食菜单json分析
在这里插入图片描述

# 导包
import json
from multiprocessing import Queue
import requests

# 创建队列
queue_list = Queue()

# 定义请求函数:3个请求header部分是一样的,只需传入url和data
def handle_request(url,data):
    # 正则替换   (.*?):(.*)  --→  "$1":"$2",
    header = {
        "client":"4",
        "version":"7109.2",
        "device":"M2007J22C",
        "sdk":"25,7.1.2",
        "imei":"351564354264020",
        "channel":"qqkp",
        "resolution":"1600*900",
        "dpi":"2.0",
        "brand":"Xiaomi",
        "scale":"2.0",
        "timezone":"28800",
        "language":"zh",
        "cns":"2",
        # "carrier":"CMCC", # 没有这项数据
        "User-Agent":"Mozilla/5.0(Linux;Android 7.1.2;M2007J22C Build/QP1A.190711.020;wv) AppleWebKit/537.36(KHTML,like Gecko) Version/4.0 Chrome/92.0.4515.131 Mobile Safari/537.36",
        "reach":"1",
        "newbie":"1",
        "Content-Type":"application/x-www-form-urlencoded; charset=utf-8",
        "Accept-Encoding":"gzip, deflate",
        "Connection":"Keep-Alive",
        "Host":"logs.douguo.net",
    }
    # 请求数据(3个请求都是post方法)
    response = requests.post(url=url,headers=header,data=data)
    return response

# 定义
def handle_index():
    # & 换成 \n 并 加""
    url = "http://api.douguo.net/recipe/flatcatalogs"
    data = {
        "client":"4",
        "_vs" : "2305",
    }
    response = handle_request(url=url,data=data)
    # 解析json数据
    catalog_response_dict = json.loads(response.text)
    for catalog_list in catalog_response_dict['result']['cs']:
        for catalog in catalog_list['cs']:
            for dishes in catalog['cs']:
                data2 = {   
                    "client": "4",
                    "keyword": dishes['name'],
                    "order": "3",
                    "_vs": "400",
                }
                # 放入队列
                queue_list.put(data2)

handle_index()
print(queue_list.qsize())
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63

spider_duogou_0.py运行结果如下:
spider_duogou_0.py运行结果


4-5编写爬虫脚本3-获取菜谱列表数据逻辑编写

4-6编写爬虫脚本4-详情页数据抓取逻辑编写

豆果美食菜谱分析
豆果美食菜谱json分析
豆果美食菜谱烹饪步骤分析
豆果美食菜谱的提示及烹饪步骤json分析

# -*- coding: utf-8 -*-
# @Author : 袁天琪
# @Time : 2022/3/14 16:53
# 将菜谱放入列表(在spider_duogou_0.py基础上进一步分析)

# 导包
import json
from multiprocessing import Queue
import requests

# 创建队列
queue_list = Queue()

# 定义请求函数:3个请求header部分是一样的,只需传入url和data
def handle_request(url,data):
    # 正则替换   (.*?):(.*)  --→  "$1":"$2",
    header = {
        "client":"4",
        "version":"7109.2",
        "device":"M2007J22C",
        "sdk":"25,7.1.2",
        "imei":"351564354264020",
        "channel":"qqkp",
        "resolution":"1600*900",
        "dpi":"2.0",
        "brand":"Xiaomi",
        "scale":"2.0",
        "timezone":"28800",
        "language":"zh",
        "cns":"2",
        # "carrier":"CMCC", # 没有这项数据
        "User-Agent":"Mozilla/5.0(Linux;Android 7.1.2;M2007J22C Build/QP1A.190711.020;wv) AppleWebKit/537.36(KHTML,like Gecko) Version/4.0 Chrome/92.0.4515.131 Mobile Safari/537.36",
        "reach":"1",
        "newbie":"1",
        "Content-Type":"application/x-www-form-urlencoded; charset=utf-8",
        "Accept-Encoding":"gzip, deflate",
        "Connection":"Keep-Alive",
        "Host":"logs.douguo.net",
    }
    # 请求数据(3个请求都是post方法)
    response = requests.post(url=url,headers=header,data=data)
    return response

# 请求首页数据
def handle_index():
    # & 换成 \n 并 加""
    url = "http://api.douguo.net/recipe/flatcatalogs"
    data = {
        "client":"4",
        "_vs" : "2305",
    }
    response = handle_request(url=url,data=data)
    # 解析json数据
    catalog_response_dict = json.loads(response.text)
    for catalog_list in catalog_response_dict['result']['cs']:
        for catalog in catalog_list['cs']:
            for dishes in catalog['cs']:
                data2 = {
                    "client": "4",
                    "keyword": dishes['name'],
                    "order": "3",
                    "_vs": "400",
                }
                # 放入队列
                queue_list.put(data2)

def handle_recipe_list(data):
    print("当前处理的食材是:",data['keyword'])
    recipe_list_url='http://api.douguo.net/recipe/v2/search/0/20'
    recipe_list_response=handle_request(url=recipe_list_url,data=data)
    recipe_list_response_dict = json.loads(recipe_list_response.text)
    for recipe_list in recipe_list_response_dict['result']['list']:
        recipe_info = {}
        recipe_info['main_ingredient'] = data['keyword']
        if recipe_list['type'] == 13:
            recipe_info['username'] = recipe_list['r']['an']
            recipe_info['ingredients_id'] =recipe_list['r']['id']
            recipe_info['describe'] =recipe_list['r']['cookstory'].replace('\n','').replace(' ','')
            recipe_info['dishname'] =recipe_list['r']['n']
            recipe_info['ingredients'] =recipe_list['r']['major']
            detail_url = 'http://api.douguo.net/recipe/detail/'+str(recipe_info['ingredients_id']) 
            # 构造请求参数
            detail_data = {
                "client": "4",
                "author_id":"0",
                "_vs": "2803",
                "_ext": '{"query":{"id":'+str(recipe_info['ingredients_id'])+',"kw":'+recipe_info['main_ingredient']+',"idx":"4";"src":"2803";"type":"13"}}',
            }
            # 请求
            detail_response = handle_request(url=detail_url,data=detail_data)
            detail_response_dict = json.loads(detail_response.text)
            recipe_info['tips'] = detail_response_dict['result']['recipe']['tips']
            recipe_info['cookstep'] = detail_response_dict['result']['recipe']['cookstep']
            print(json.dumps(recipe_info))
        else:
            continue


handle_index()
handle_recipe_list(queue_list.get())
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100

spider_douguo_1.py运行结果如下:
spider_douguo_1.py运行结果
在这里插入图片描述


4-7编写爬虫脚本5-数据入库逻辑编写

任务:将数据保存到mongodb中

出现的问题

  • ModuleNotFoundError: No module named ‘pip‘,解决方法:依次输入命令
    • python -m ensurepip
    • python -m pip install --upgrade pip
    • 参考网址:https://blog.csdn.net/haihonga/article/details/100168691
  • ERROR: Could not find a version that satisfies the requirement pip22.0.4 ERROR: No matching distribution found for pip22.0.4,解决方法:关闭fiddler
    • 参考网址:https://blog.csdn.net/weixin_44917577/article/details/118309074
  • MongoDB及MongoDBCompass安装
    • 参考网址:https://www.cnblogs.com/daisy-fung1314/p/soft-install-note3.html
    • 安装路径:D:\Program Files\MongoDB\Server\5.0及D:\Program Files\MongodbCompass
  • MongoDB添加管理员
    • 参考网址: https://blog.csdn.net/xiaoxiangzi520/article/details/81094378
    • 用户名:admin,密码:123456
  • TypeError: ‘Collection’ object is not callable. If you meant to call the ‘insert’ method on a ‘Collection’ object it is failing because no such method exists.
    • 错误原因:pymongo4之后移除了Collection.insert()等很多方法
    • 解决方法:pip uninstall pymongo(删除旧的包)并pip install --index-url https://pypi.tuna.tsinghua.edu.cn/simple/ pymongo==3.0.3(加载版本较低的包)
  • pymongo.errors.ServerSelectionTimeoutError: 10.62.73.160:27017: [WinError 10061] 由于目标计算机积极拒绝,无法连接。
    • 错误原因:没有启动本地的MongoDB服务,解决方法参考以下网址:
    • https://www.cnblogs.com/wjaaron/p/7800490.html
    • https://blog.csdn.net/weixin_38752101/article/details/119927477
    • https://www.imooc.com/qadetail/306841,将local_host设成127.0.0.1
# -*- coding: utf-8 -*-
# @Author : 袁天琪
# @Time : 2022/3/15 16:01

# 导包
import pymongo
from pymongo.collection import Collection

# 定义类
class Connect_mongo(object):
    def __init__(self):
        self.client = pymongo.MongoClient(host='127.0.0.1',port=27017)
        self.db_data = self.client['douguo'] # 数据库名

    def insert_item(self,item):
        db_collection = Collection(self.db_data,'douhuo_item')
        db_collection.insert(item)

# 实例化
mongo_info = Connect_mongo()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
# -*- coding: utf-8 -*-
# @Author : 袁天琪
# @Time : 2022/3/14 21:18
# 将数据保存到mongodb中

# 导包
import json
from multiprocessing import Queue
import requests
from handle_mongo import mongo_info

# 创建队列
queue_list = Queue()

# 定义请求函数:3个请求header部分是一样的,只需传入url和data
def handle_request(url,data):
    # 正则替换   (.*?):(.*)  --→  "$1":"$2",
    header = {
        "client":"4",
        "version":"7109.2",
        "device":"M2007J22C",
        "sdk":"25,7.1.2",
        "imei":"351564354264020",
        "channel":"qqkp",
        "resolution":"1600*900",
        "dpi":"2.0",
        "brand":"Xiaomi",
        "scale":"2.0",
        "timezone":"28800",
        "language":"zh",
        "cns":"2",
        # "carrier":"CMCC", # 没有这项数据
        "User-Agent":"Mozilla/5.0(Linux;Android 7.1.2;M2007J22C Build/QP1A.190711.020;wv) AppleWebKit/537.36(KHTML,like Gecko) Version/4.0 Chrome/92.0.4515.131 Mobile Safari/537.36",
        "reach":"1",
        "newbie":"1",
        "Content-Type":"application/x-www-form-urlencoded; charset=utf-8",
        "Accept-Encoding":"gzip, deflate",
        "Connection":"Keep-Alive",
        "Host":"logs.douguo.net",
    }
    # 请求数据(3个请求都是post方法)
    response = requests.post(url=url,headers=header,data=data)
    return response

# 请求首页数据
def handle_index():
    # & 换成 \n 并 加""
    url = "http://api.douguo.net/recipe/flatcatalogs"
    data = {
        "client":"4",
        "_vs" : "2305",
    }
    response = handle_request(url=url,data=data)
    # 解析json数据
    catalog_response_dict = json.loads(response.text)
    for catalog_list in catalog_response_dict['result']['cs']:
        for catalog in catalog_list['cs']:
            for dishes in catalog['cs']:
                data2 = {
                    "client": "4",
                    "keyword": dishes['name'],
                    "order": "3",
                    "_vs": "400",
                }
                # 放入队列
                queue_list.put(data2)

def handle_recipe_list(data):
    print("当前处理的食材是:",data['keyword'])
    recipe_list_url='http://api.douguo.net/recipe/v2/search/0/20'
    recipe_list_response=handle_request(url=recipe_list_url,data=data)
    recipe_list_response_dict = json.loads(recipe_list_response.text)
    for recipe_list in recipe_list_response_dict['result']['list']:
        recipe_info = {}
        recipe_info['main_ingredient'] = data['keyword']
        if recipe_list['type'] == 13:
            recipe_info['username'] = recipe_list['r']['an']
            recipe_info['ingredients_id'] =recipe_list['r']['id']
            recipe_info['describe'] =recipe_list['r']['cookstory'].replace('\n','').replace(' ','')
            recipe_info['dishname'] =recipe_list['r']['n']
            recipe_info['ingredients'] =recipe_list['r']['major']
            detail_url = 'http://api.douguo.net/recipe/detail/'+str(recipe_info['ingredients_id']) 
            # 构造请求参数
            detail_data = {
                "client": "4",
                "author_id":"0",
                "_vs": "2803",
                "_ext": '{"query":{"id":'+str(recipe_info['ingredients_id'])+',"kw":'+recipe_info['main_ingredient']+',"idx":"4";"src":"2803";"type":"13"}}',
            }
            # 请求
            detail_response = handle_request(url=detail_url,data=detail_data)
            detail_response_dict = json.loads(detail_response.text)
            recipe_info['tips'] = detail_response_dict['result']['recipe']['tips']
            recipe_info['cookstep'] = detail_response_dict['result']['recipe']['cookstep']
            print('当前入库的菜谱是:',recipe_info['dishname'])
            mongo_info.insert_item(recipe_info)
        else:
            continue


handle_index()
handle_recipe_list(queue_list.get())
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102

spider_douguo_2.py运行结果如下:
spider_douguo_2.py运行结果


4-8编写爬虫脚本6-多线程逻辑编写

任务:通过python多线程-线程池抓取数据

# -*- coding: utf-8 -*-
# @Author : 袁天琪
# @Time : 2022/3/14 21:18
# 通过python多线程-线程池抓取数据

# 导包
import json
from multiprocessing import Queue
import requests
from handle_mongo import mongo_info
from concurrent.futures import  ThreadPoolExecutor

# 创建队列
queue_list = Queue()

# 定义请求函数:3个请求header部分是一样的,只需传入url和data
def handle_request(url,data):
    # 正则替换   (.*?):(.*)  --→  "$1":"$2",
    header = {
        "client":"4",
        "version":"7109.2",
        "device":"M2007J22C",
        "sdk":"25,7.1.2",
        "imei":"351564354264020",
        "channel":"qqkp",
        "resolution":"1600*900",
        "dpi":"2.0",
        "brand":"Xiaomi",
        "scale":"2.0",
        "timezone":"28800",
        "language":"zh",
        "cns":"2",
        # "carrier":"CMCC", # 没有这项数据
        "User-Agent":"Mozilla/5.0(Linux;Android 7.1.2;M2007J22C Build/QP1A.190711.020;wv) AppleWebKit/537.36(KHTML,like Gecko) Version/4.0 Chrome/92.0.4515.131 Mobile Safari/537.36",
        "reach":"1",
        "newbie":"1",
        "Content-Type":"application/x-www-form-urlencoded; charset=utf-8",
        "Accept-Encoding":"gzip, deflate",
        "Connection":"Keep-Alive",
        "Host":"logs.douguo.net",
    }
    # 请求数据(3个请求都是post方法)
    response = requests.post(url=url,headers=header,data=data)
    return response

# 请求首页数据
def handle_index():
    # & 换成 \n 并 加""
    url = "http://api.douguo.net/recipe/flatcatalogs"
    data = {
        "client":"4",
        "_vs" : "2305",
    }
    response = handle_request(url=url,data=data)
    # 解析json数据
    catalog_response_dict = json.loads(response.text)
    for catalog_list in catalog_response_dict['result']['cs']:
        for catalog in catalog_list['cs']:
            for dishes in catalog['cs']:
                data2 = {
                    "client": "4",
                    "keyword": dishes['name'],
                    "order": "3",
                    "_vs": "400",
                }
                # 放入队列
                queue_list.put(data2)
# 线程的处理函数,把队列中的数据get出来
# 请求的是菜谱的列表页和详情页
def handle_recipe_list(data):
    print("当前处理的食材是:",data['keyword'])
    recipe_list_url='http://api.douguo.net/recipe/v2/search/0/20'
    # 第一次请求
    recipe_list_response=handle_request(url=recipe_list_url,data=data)
    recipe_list_response_dict = json.loads(recipe_list_response.text)
    for recipe_list in recipe_list_response_dict['result']['list']:
        recipe_info = {}
        recipe_info['main_ingredient'] = data['keyword']
        if recipe_list['type'] == 13:
            recipe_info['username'] = recipe_list['r']['an']
            recipe_info['ingredients_id'] =recipe_list['r']['id']
            recipe_info['describe'] =recipe_list['r']['cookstory'].replace('\n','').replace(' ','')
            recipe_info['dishname'] =recipe_list['r']['n']
            recipe_info['ingredients'] =recipe_list['r']['major']
            detail_url = 'http://api.douguo.net/recipe/detail/'+str(recipe_info['ingredients_id'])
            # 构造请求参数
            detail_data = {
                "client": "4",
                "author_id":"0",
                "_vs": "2803",
                "_ext": '{"query":{"id":'+str(recipe_info['ingredients_id'])+',"kw":'+recipe_info['main_ingredient']+',"idx":"4";"src":"2803";"type":"13"}}',
            }
            # 第二次请求
            detail_response = handle_request(url=detail_url,data=detail_data)
            detail_response_dict = json.loads(detail_response.text)
            recipe_info['tips'] = detail_response_dict['result']['recipe']['tips']
            recipe_info['cookstep'] = detail_response_dict['result']['recipe']['cookstep']
            print('当前入库的菜谱是:',recipe_info['dishname'])
            # 存入mongodb
            mongo_info.insert_item(recipe_info)
        else:
            continue


handle_index()
# 实现多线程抓取,引入了线程池
pool = ThreadPoolExecutor(max_workers=20)
while queue_list.qsize()>0:
    pool.submit(handle_recipe_list,queue_list.get())
# handle_recipe_list(queue_list.get())
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110

spider_douguo_3.py运行结果如下:
spider_douguo_3.py运行结果


4-9编写爬虫脚本7-伪装爬虫-编写代理逻辑

任务:通过使用代理ip隐藏爬虫

# -*- coding: utf-8 -*-
# @Author : 袁天琪
# @Time : 2022/3/14 21:18
# 通过python多线程-线程池抓取数据

# 导包
import json
from multiprocessing import Queue
import requests
from handle_mongo import mongo_info
from concurrent.futures import  ThreadPoolExecutor

# 创建队列
queue_list = Queue()

# 定义请求函数:3个请求header部分是一样的,只需传入url和data
def handle_request(url,data):
    # 正则替换   (.*?):(.*)  --→  "$1":"$2",
    header = {
        "client":"4",
        "version":"7109.2",
        "device":"M2007J22C",
        "sdk":"25,7.1.2",
        "imei":"351564354264020",
        "channel":"qqkp",
        "resolution":"1600*900",
        "dpi":"2.0",
        "brand":"Xiaomi",
        "scale":"2.0",
        "timezone":"28800",
        "language":"zh",
        "cns":"2",
        # "carrier":"CMCC", # 没有这项数据
        "User-Agent":"Mozilla/5.0(Linux;Android 7.1.2;M2007J22C Build/QP1A.190711.020;wv) AppleWebKit/537.36(KHTML,like Gecko) Version/4.0 Chrome/92.0.4515.131 Mobile Safari/537.36",
        "reach":"1",
        "newbie":"1",
        "Content-Type":"application/x-www-form-urlencoded; charset=utf-8",
        "Accept-Encoding":"gzip, deflate",
        "Connection":"Keep-Alive",
        "Host":"logs.douguo.net",
    }
    # 设置代理ip
    proxy = {'http': '106.54.128.253:999'}
    # 请求数据(3个请求都是post方法)
    response = requests.post(url=url,headers=header,data=data,proxies=proxy)
    return response

# 请求首页数据
def handle_index():
    # & 换成 \n 并 加""
    url = "http://api.douguo.net/recipe/flatcatalogs"
    data = {
        "client":"4",
        "_vs" : "2305",
    }
    response = handle_request(url=url,data=data)
    # 解析json数据
    catalog_response_dict = json.loads(response.text)
    for catalog_list in catalog_response_dict['result']['cs']:
        for catalog in catalog_list['cs']:
            for dishes in catalog['cs']:
                data2 = {
                    "client": "4",
                    "keyword": dishes['name'],
                    "order": "3",
                    "_vs": "400",
                }
                # 放入队列
                queue_list.put(data2)
# 线程的处理函数,把队列中的数据get出来
# 请求的是菜谱的列表页和详情页
def handle_recipe_list(data):
    print("当前处理的食材是:",data['keyword'])
    recipe_list_url='http://api.douguo.net/recipe/v2/search/0/20'
    # 第一次请求
    recipe_list_response=handle_request(url=recipe_list_url,data=data)
    recipe_list_response_dict = json.loads(recipe_list_response.text)
    for recipe_list in recipe_list_response_dict['result']['list']:
        recipe_info = {}
        recipe_info['main_ingredient'] = data['keyword']
        if recipe_list['type'] == 13:
            recipe_info['username'] = recipe_list['r']['an']
            recipe_info['ingredients_id'] =recipe_list['r']['id']
            recipe_info['describe'] =recipe_list['r']['cookstory'].replace('\n','').replace(' ','')
            recipe_info['dishname'] =recipe_list['r']['n']
            recipe_info['ingredients'] =recipe_list['r']['major']
            detail_url = 'http://api.douguo.net/recipe/detail/'+str(recipe_info['ingredients_id'])
            # 构造请求参数
            detail_data = {
                "client": "4",
                "author_id":"0",
                "_vs": "2803",
                "_ext": '{"query":{"id":'+str(recipe_info['ingredients_id'])+',"kw":'+recipe_info['main_ingredient']+',"idx":"4";"src":"2803";"type":"13"}}',
            }
            # 第二次请求
            detail_response = handle_request(url=detail_url,data=detail_data)
            detail_response_dict = json.loads(detail_response.text)
            recipe_info['tips'] = detail_response_dict['result']['recipe']['tips']
            recipe_info['cookstep'] = detail_response_dict['result']['recipe']['cookstep']
            print('当前入库的菜谱是:',recipe_info['dishname'])
            # 存入mongodb
            mongo_info.insert_item(recipe_info)
        else:
            continue


handle_index()
# 实现多线程抓取,引入了线程池
pool = ThreadPoolExecutor(max_workers=2)
while queue_list.qsize()>0:
    pool.submit(handle_recipe_list,queue_list.get())
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111

spider_douguo_4.py运行结果如下:
spider_douguo_4.py运行结果

4-10 第4章爬虫总结

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/运维做开发/article/detail/968409
推荐阅读
相关标签
  

闽ICP备14008679号