数据可视化三步走（一）：数据采集与存储，利用python爬虫框架scrapy爬取网络数据并存储_scrapy数据可视化

作者：羊村懒王 | 2024-06-09 22:17:25

踩

scrapy数据可视化

前言

最近在研究python爬虫，突然想写博客了，那就写点东西吧。给自己定个小目标，做一个完整的简单的数据可视化的小项目，把整个相关技术链串联起来，目的就是为了能够对这块有个系统的认识，具体设计思路如下：

1. 利用python爬虫框架scrapy爬取网络数据并存储到mysql中;
2. 利用springboot mybatis 作为web后台服务;
3. 利用thymeleaf模板引擎 +echarts完成数据可视化。
1
2
3

本章节完成第1点

1.Windows下搭建python环境

下载python3.6.4并安装，注意：为了方便，一定要安装pip模块和加入环境变量：
这里写图片描述
命令行执行python查看是否安装成功：

2.利用virtualenv创建虚拟环境，并安装scrapy框架

安装virtualenv: pip install virtualenv
新建文件夹PythonENV(自己随便建，虚拟环境目录)，用于创建虚拟环境
创建虚拟环境env22：命令行cd 到PythonENV下，执行命令virtualenv env22
激活env22：进入到env22\Scripts\下执行命令activate
如上图，激活后出现(env22)的前缀，说明目前已经激活成功并处于虚拟环境env22下，接下来我们就要在虚拟环境env22中安装scrapy了，执行命令：pip install Scrapy

这里遇到个问题：

安装win32api模块：pip install pywin32
还需要将如下DLL拷贝到System32下：
创建Scrapy项目：Scrapy startproject mydemo
将创建好的项目导入pycharm，结构如下：

`至此，环境搭建和scrapy项目架构基本就完成了！`

3. 编写spider爬虫，爬取豆瓣数据

1.items.py中定义豆瓣对象类，用于数据抽象封装:

class DouBanItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    movie_title = scrapy.Field()
    movie_score = scrapy.Field()
    movie_eval_num = scrapy.Field()
    movie_quote = scrapy.Field()
1
2
3
4
5
6
7

2.编写spider，用于爬取豆瓣数据:

# encoding:utf-8

from scrapy.http import Request
from scrapy.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from mydemo.items import DouBanItem

class DouBanSpider(CrawlSpider):
    name = "douban"
    allowed_domains = ["movie.douban.com"]
    start_urls = ["https://movie.douban.com/top250"]
    #
    # rules = (
    #     # 将所有符合正则表达式的url加入到抓取列表中
    #     Rule(LinkExtractor(allow=(r'https://movie\.douban\.com/top250\?start=\d+&filter=&type=',))),
    #     # 将所有符合正则表达式的url请求后下载网页代码, 形成response后调用自定义回调函数
    #     # 其实就是列表页每一部电影的详情页面
    #     Rule(LinkExtractor(allow=(r'https://movie\.douban\.com/subject/\d+',)), callback='parse_page', follow=True),
    # )

    def parse(self, response):
        doubanItem = DouBanItem()
        selector = Selector(response)
        movies = selector.xpath('//ol[@class="grid_view"]/li')

        for m in movies:
            # 电影名称
            doubanItem['movie_title'] = m.xpath('div/div[2]/div[1]/a/span[1]/text()').extract()[0]
            # 电影评分
            doubanItem['movie_score'] = m.xpath('div/div[2]/div[2]/div/span[2]/text()').extract()[0]
            # 电影评价人数
            doubanItem['movie_eval_num'] = m.xpath('div/div[2]/div[2]/div/span[4]/text()').extract()[0][:-3]
            # movie_eval_num = re.findall(r'\d+', movie_eval)[-1]  # 用切片也可以
            # 电影短评 可能为空，发现不加[0] 也可以
            movie_quote = m.xpath('div/div[2]/div[2]/p[2]/span/text()')[0]
            if movie_quote:
                doubanItem['movie_quote'] = movie_quote.extract()
            else:
                doubanItem['movie_quote'] = ''

            yield doubanItem

        for p in range(9):  # 第2页到第10页
            url_ = "https://movie.douban.com/top250?start={}&filter=".format(str((p+1)*25))
            yield Request(url_, self.parse)





1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50

3.修改配置文件settings.py，增加user_agent，禁用robot协议，以防止被禁

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = ['Mozilla/5.0 (Windows NT 10.0; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0',
              'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
              'Chrome/42.0.2311.135 Safari/537.36 Edge/12.10240']

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
1
2
3
4
5
6
7

4.数据存储

1.设置mysql数据源（这是我本地的mysql）：

# db configure
MYSQL_HOST = 'localhost'
MYSQL_DBNAME = 'python'
MYSQL_USER = 'root'
MYSQL_PASSWD = 'root'
1
2
3
4
5

2.编写pipeline，用于处理爬取后返回的数据：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from logging import log
import pymysql
from mydemo import settings


class MySqlPipeline(object):
    # 初始化数据库
    def __init__(self):
        self.connect = pymysql.connect(
            host=settings.MYSQL_HOST,
            db=settings.MYSQL_DBNAME,
            user=settings.MYSQL_USER,
            passwd=settings.MYSQL_PASSWD,
            charset='utf8',
            use_unicode=True
        )
        # 通过cursor执行增删查改
        self.cursor = self.connect.cursor()

    # 处理返回的item数据
    def process_item(self, item, spider):
        try:
            # 增加查重处理
            self.cursor.execute(
                """select * from t_movie where title = %s""",
                item['movie_title'])
            # 是否有重复数据
            repetition = self.cursor.fetchone()

            if repetition:
                pass
            else:
                # 插入数据
                self.cursor.execute(
                    """insert into t_movie (title,score,eval_num,m_quote)
                    values (%s, %s, %s, %s)""",
                    (item['movie_title'],
                     item['movie_score'],
                     item['movie_eval_num'],
                     item['movie_quote']))

                # 提交sql语句
                self.connect.commit()

        except Exception as error:
            # 出现错误时打印错误日志
            log(error)

        return item

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56