python爬取vue2官方文档_python爬虫能抓取vue2页面吗

作者：IT小白 | 2024-02-07 17:27:39

踩

python爬虫能抓取vue2页面吗

本文原地址

vue离线文档下载地址

该文档是vue2版本离线中文文档，由爬虫程序在官网爬取，包括文档、api、示例、风格指南等几个部分，下载地址是：vue2离线文档

可运行源程序及说明

为了程序的正常运行，需要按一下目录建立文件夹和文件，这个层次目录是根据源网站的目录建立的，通过浏览器的开发者模式可以看到

主程序：vue_crawl.py

import requests
import re
import time
class VueCrawl:
    headers = {
        'Referer': 'https://vuejs.bootcss.com/',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
    }
    # 网站根目录
    base_url = 'https://vuejs.bootcss.com'
    # v2版本根索引目录
    index_url = 'https://vuejs.bootcss.com/v2/'
    # 爬取目标
    targets = ['style-guide', 'api', 'cookbook', 'examples', 'guide']
    # 存放文档的根目录
    base_dir = 'D:/code/python/vue_crawl/vue_files'

    # 提取url的正则表达式
    url_pattern = re.compile(r"<a\s+[^>]*href=\"([^#>\"]*)\"[^>]*>([^<]*)</a>")
    # 提取css的正则表达式
    css_pattern = re.compile(r"<link\s+[^>]*stylesheet[^>]*\s+href=\"([^#>\"]*)\"[^>]*>")
    # 提取js的正则表达式
    js_pattern = re.compile(r"<script\s+[^>]*src=\"([^>\"]*)\"[^>]*>\s*</script>")
    # 提取img的正则表达式
    img_pattern = re.compile(r"<img\s+[^>]*src=\"([^>\"]*)\"[^>]*>")
    # 由于爬取到的静态资源可能重复，所以用set存放
    css_set = set()
    js_set = set()
    img_set = set()
    # 抓取资源文件失败时记录错误信息
    error_info = []

    @staticmethod
    def download(abspath, content):
        """存储资源文件，参数content为二进制形式"""
        with open(abspath, 'wb')as f:
            f.write(content)

    def fix_pagesurl(self, content):
        """修正链接路径为相对路径，否则爬下来的链接不会指向正确的位置"""
        res_text = content.decode('utf-8', errors='ignore')
        css_search_res = self.css_pattern.findall(res_text)
        # css链接到base_dir目录的css文件夹下
        for item in css_search_res:
            if not item.startswith(('http://', 'https://')):
                self.css_set.add(item)
                res_text = res_text.replace(item, "../.." + item)

        js_search_res = self.js_pattern.findall(res_text)
        # js链接到base_dir目录的js文件夹下
        for item in js_search_res:
            if not item.startswith(('http://', 'https://')):
                self.js_set.add(item)
                res_text = res_text.replace(item, "../.." + item)

        img_search_res = self.img_pattern.findall(res_text)
        # 图片链接到base_dir目录的images文件夹下
        for item in img_search_res:
            if not item.startswith(('http://', 'https://')):
                self.img_set.add(item)
                res_text = res_text.replace(item, "../.." + item)

        url_search_res = self.url_pattern.findall(res_text)
        # 文档链接到当前文件夹下
        for item in url_search_res:
            item_url = item[0]
            if not item_url.startswith(('http://', 'https://')) and (
                    item_url.startswith("/v2/") and item_url.endswith('.html')):
                res_text = res_text.replace(item_url, "./" + item_url.split('/')[-1])
        return res_text.encode('utf-8')

    def crawl_pages(self, target):
        """根据target [style-guide','api','cookbook','examples','guide']这几个主要部分爬取"""
        url_set = set()
        init_r = requests.get(self.index_url + target, headers=self.headers)
        init_text = init_r.content.decode('utf-8', errors='ignore')

        url_search_res = self.url_pattern.findall(init_text)
        for item in url_search_res:
            item_url = item[0]
            if not item_url.startswith(('http://', 'https://')) and (
                    item_url.startswith("/v2/" + target) and item_url.endswith('.html')):
                url_set.add(item_url)

        # 下载每个部分的首页
        VueCrawl.download(self.base_dir + '/v2/' + target + '/index.html', content=self.fix_pagesurl(init_r.content))

        # 对首页部分所有判定有效的链接进行下载
        for item in url_set:
            try:
                filename = item.split('/')[-1]
                res = requests.get(self.base_url + item, headers=self.headers)
                print(str(res.status_code) + ":" + res.url)
                res_content = res.content
                VueCrawl.download(self.base_dir + '/v2/' + target + '/' + filename,
                                  content=self.fix_pagesurl(res_content))
                print('download file %s' % filename)
                time.sleep(2)
            except:
                info = 'download file %s faild' % item
                print(info)
                self.error_info.append(info)

    def download_staticfiles(self):
        """下载静态资源文件"""
        for item in self.js_set | self.img_set | self.css_set:
            try:
                res_content = requests.get(self.base_url + item, headers=self.headers).content
                self.download(self.base_dir + item, res_content)
                time.sleep(1)
            except:
                info = 'download file %s faild' % item
                print(info)
                self.error_info.append(info)

    def main(self):
        """主体部分 首先爬文档，之后下载静态资源文件"""
        for target in self.targets:
            self.crawl_pages(target)

        self.download_staticfiles()


if __name__ == '__main__':
    init_r = VueCrawl().main()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126

抓取过程分析

主要抓取的东西有几个大页面，在程序中根据url写死就行了，如下图：

# 要抓取的几个大部分
targets = ['style-guide', 'api', 'cookbook', 'examples', 'guide']
1
2

每个大部分左侧又有许多单独页面，需要一一抓取，如下图：

主要链接需要用正则表达式提取出来,不提取包含#的链接，不然会造成重复抓取，不抓取其他域名下的链接，这里判断的比较简单，直接判断是否以http,https开头：

# url_pattern = re.compile(r"<a\s+[^>]*href=\"([^#>\"]*)\"[^>]*>([^<]*)</a>")

url_set = set()
init_r = requests.get(self.index_url + target, headers=self.headers)
init_text = init_r.content.decode('utf-8', errors='ignore')

url_search_res = self.url_pattern.findall(init_text)
for item in url_search_res:
	item_url = item[0]
	if not item_url.startswith(('http://', 'https://')) and (item_url.startswith("/v2/" + target) and item_url.endswith('.html')):
		url_set.add(item_url)
1
2
3
4
5
6
7
8
9
10
11

之后是对目标进行下载，这里不再详细说明，需要注意的是，在下载之前先要根据目录改变各种资源的路径层次结构

声明：本文内容由网友自发贡献，转载请注明出处：【wpsshop博客】

python爬取vue2官方文档_python爬虫能抓取vue2页面吗

本文原地址

目录

vue离线文档下载地址

可运行源程序及说明

抓取过程分析

基于openCV-python：HSI颜色空间与H-S直方图比较_his颜色空间直方图

OpenCV—python 颜色空间（RGB，HSV，Lab）与颜色直方图_opencv python 颜色直方图

python 安装使用 IntelliJ IDEA插件python_python插件怎么在idea中安装

vue-cli3使用axios请求后端（Python+Django）的数据和跨域proxy设置踩坑指南_unused definition $axios

python Flask 编写 api 接口，CORS 解决 flask 跨域问题_flask-cors python3 import 的名称

Python 读取txt文本总结：read()、readlines()并去掉\n_python read txt

Python 典藏篇-Microsoft Visual C++ 14.0 is required，官方vc++运行库工具一键式解决！

【Python】安装包的时候提示缺少Microsoft Visual C++ 14.0的解决方案

python 安装库时提示缺少VC++ 14.0_缺少vc++14.0

Python 安装包时 VC 14 找不到错误终极解决办法_python setup.py msvc

python setup.py VS2019攻略_error: command 'c:\\program files (x86)\\microsoft

python distutils打包C/C++模块，执行python setup.py build_ext --inplace时报错cl

python visual studio 14_python安装jpype1、pyhanlp时出现的“Microsoft Visual C++ 14.0 is required.”问题解决...

python编译安装没有c_#PYTHON# 编译并安装第三方模块遇到的问题

报错！error: subprocess-exited-with-error python setup.py bdist_wheel did not run successfully._error: subprocess-exited-with-error 脳 python setup

python 3.4 vc++编译配置_windows平台使用Microsoft Visual C++ Compiler for Python 2.7编译python扩展...

python setup.py bdist_wheel did not run successfully

[Python]pip install pygame安装报错解决方案_error: subprocess-exited-with-error 脳 python setup

Python-share package安装问题解决方案_error: subprocess-exited-with-error 脳 python setup

分奖金（python)_分奖金问题算法python