赞
踩
在本文中,我们将介绍如何使用Django开发一个简单但功能强大的爬虫系统。我们将使用Python编写爬虫,并将爬取到的数据存储到Django模型中,然后通过Django的管理页面管理这些数据。
爬虫系统用于从互联网上收集信息,常用于数据分析、搜索引擎等领域。我们选择使用Django作为开发框架,因为它提供了强大的数据库管理和Web开发功能,能够帮助我们快速搭建一个稳健的爬虫系统。
首先确保您已经安装了Python,并使用pip安装了Django和其他必要的库:
pip install django requests beautifulsoup4
使用Django命令行工具创建一个新的项目:
- django-admin startproject mycrawler
- cd mycrawler
在models.py
文件中定义一个模型来存储爬取到的数据:
- from django.db import models
-
- class Article(models.Model):
- title = models.CharField(max_length=100)
- content = models.TextField()
- created_at = models.DateTimeField(auto_now_add=True)
-
- def __str__(self):
- return self.title
运行数据库迁移命令以创建数据库表:
- python manage.py makemigrations
- python manage.py migrate
- from django.shortcuts import render
- from django.http import HttpResponse
- from django.http import JsonResponse
- import json
- import bag
- import re
- from bs4 import BeautifulSoup
- from tqdm import tqdm
- import time
- import random
-
-
- def index(request):
- return render(request, 'index.html')
-
-
- def get_url(city_name, item):
- session = bag.session.create_session()
- for cookie in bag.Bag.read_json(r'./static/cookies.json'):
- session.cookies.set(cookie['name'], cookie['value'])
- session.headers['Connection'] = 'close'
- js_data = bag.Bag.read_json('./static/city.json')
- judge = js_data.get(city_name)
- if judge:
- city = judge.split('/')[-1]
- js_data1 = bag.Bag.read_json('./static/menu.json')
- judge1 = js_data1.get(item)
- if judge1:
- return judge1.format(city), session
-
-
- def get_shops(request):
- if request.method == 'POST':
- try:
- data = json.loads(request.body)
- if not bool(data.get('city')) and not bool(data.get('shopLink')):
- return JsonResponse({'message': '请选择城市'}, status=200)
- elif not bool(data.get('city')) and bool(data.get('shopLink')): # 处理传入具体链接后续开发
- pass
- else:
- url, session = get_url(data.get('city'), data.get('selectedType'))
-
- def get_types(): # 正常传参
- pattern = re.compile(r'<a.*?href="(.*?)".*?<span>(.*?)</span></a>', re.S)
- if bool(url):
- resp = session.get(url)
- html = BeautifulSoup(resp.text, 'html.parser')
- soup = html.findAll('div', id='classfy')
- links = re.findall(pattern, str(soup))
- return links
- else:
- return None
-
- def get_shop():
- links = get_types()
- pattern = re.compile(r'<div class="tit">.*?<a.*?data-shopid="(.*?)".*?href="(.*?)".*?title="(.*?)"'
- r'(?:.*?<div class="star_icon">.*?<span class="(.*?)"></span>.*?<b>(.*?)</b>)?'
- r'(?:.*?<b>(.*?)</b>)?'
- r'(?:.*?<div class="tag-addr">.*?<span class="tag">(.*?)</span>.*?<em class="sep">.*?<span class="tag">(.*?)</span>)?',
- re.S)
- number = re.compile(r'data-ga-page="(.*?)"', re.S)
-
- result = []
-
- print(len(links))
- for link in links[11:12]: # 获取第一页
- resp = session.get(link[0])
- time.sleep(random.randint(1, 10))
- page = [int(i) for i in re.findall(number, resp.text)]
- page_num = sorted(page, reverse=True)[0]
- html = BeautifulSoup(resp.text, 'html.parser')
-
- soup = html.findAll('li', class_='')
- for i in soup:
- for j in re.findall(pattern, str(i)):
- result.append(j)
- if page_num >= 2: # 获取第一页往后
- for count in tqdm(range(page_num)):
- try:
- resp1 = session.get(link[0] + 'p{}'.format(count + 1))
- time.sleep(random.randint(1, 5))
- # time.sleep(0.5)
- html1 = BeautifulSoup(resp1.text, 'html.parser')
- soup1 = html1.findAll('li', class_='')
- for k in soup1:
- info = pattern.search(str(k))
- if info:
- groups = list(info.groups())
- for i in range(len(groups)):
- if not groups[i]:
- groups[i] = 'null'
- result.append(tuple(groups))
- except Exception as e:
- print(e)
- continue
- else:
- pass
- return result
-
- end = get_shop()
- bag.Bag.save_excel(end, f'./{data.get("city")}_{data.get("selectedType")}.xlsx')
-
- return JsonResponse({'message': 'Data received successfully'}, status=200)
- except json.JSONDecodeError as e:
- return JsonResponse({'error': 'Invalid JSON format'}, status=400)
- else:
- return JsonResponse({'error': 'Only POST requests are allowed'}, status=405)
index.html
- <!DOCTYPE html>
- <html>
- <head>
- <meta charset="utf-8" />
- <title>首页</title>
- <link rel="stylesheet" href="../static/css/page.css" />
- <script type="text/javascript" src="../static/js/jquery.min.js" ></script>
- <script type="text/javascript" src="../static/js/index.js" ></script>
- </head>
- <body>
- <div class="left">
- <div class="bigTitle">爬虫管理系统</div>
- <div class="lines">
- <div onclick="pageClick(this)" data-html="/text.html" data-title="文本爬虫"><img src="../static/img/icon-1.png" />文本爬虫</div>
- <div onclick="pageClick(this)" data-html="/image.html" data-title="图片爬虫"><img src="../static/img/icon-2.png" />图片爬虫</div>
- <div onclick="pageClick(this)" data-html="/music.html" data-title="音乐爬虫"><img src="../static/img/icon-3.png" />音乐爬虫</div>
- <div onclick="pageClick(this)" data-html="/video.html" data-title="视频爬虫"><img src="../static/img/icon-4.png" />视频爬虫</div>
- </div>
- </div>
- <div class="top">
- <div class="leftTiyle" id="flTitle">基于python实现的网络爬虫</div>
- <div class="thisUser">当前用户:互联网民工</div>
- </div>
- <div class="content" id="contentDiv"></div>
- <script>
- function pageClick(element) {
- var clickedElement = $(element);
- $('#contentDiv').empty();
- var htmlPath = clickedElement.attr('data-html');
- var pageTitle = clickedElement.attr('data-title');
-
- $.ajax({
- url: htmlPath,
- dataType: 'html',
- success: function (data) {
- $('#contentDiv').html(data);
- $('#flTitle').text(pageTitle); // 设置标题
- },
- error: function (xhr, status, error) {
- // 处理加载失败的情况
- }
- });
- }
-
- $(document).ready(function() {
- var defaultContent = "欢迎使用爬虫管理系统!请点击左侧功能以查看详细内容。"
- $('#contentDiv').html(defaultContent);
- });
-
- </script>
- </body>
- </html>
将Django应用部署到生产环境时,请确保使用安全的方法,并考虑使用Docker进行容器化部署,使用Nginx和Gunicorn进行Web服务器和应用服务器的配置等。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。