小小林熬夜学编程

这个屌丝很懒，什么也没留下！

热门标签

Python爬虫实战之爬取微博热搜

作者：小小林熬夜学编程 | 2024-06-12 11:44:28

踩

爬取微博热搜

前言

在开始之前，我们需要了解一些基本的爬虫知识。Python爬虫是一种自动化获取网页内容的技术，可以模拟浏览器行为，获取网页源代码，并从中提取所需的信息。在爬取微博热搜榜单时，我们需要发送HTTP请求获取网页源代码，然后使用正则表达式或者解析库对源代码进行解析和提取。

爬取目标（效果展示）

在使用Python进行爬虫的过程中，经常需要获取微博热搜榜单的数据。微博热搜榜单是一个非常有价值的信息源，可以了解当前社会热点事件和用户关注度。本文将介绍使用Python爬取微博热搜榜单的方法。

效果展示：在这里插入图片描述

爬取的内容是：标题、榜单、热度值、新闻类型、时间戳、url地址等

准备工作

我用的是python3.8，VScode编辑器，所需的库有：requests、etree、time

开头导入所需用到的导入的库：

python
复制代码
import requests # python基础爬虫库
from lxml import etree # 可以将网页转换为Elements对象
import time # 防止爬取过快可以睡眠一秒

1
2
3
4
5
6

建表：

CREATE TABLE "WB_HotList" (
	"id" INT IDENTITY(1,1) PRIMARY key,
	"batch" NVARCHAR(MAX),
	"daydate" SMALLDATETIME,
	"star_word" NVARCHAR(MAX),
	"title" NVARCHAR(MAX),
	"category" NVARCHAR(MAX),
	"num" NVARCHAR(MAX),
	"subject_querys" NVARCHAR(MAX),
	"flag" NVARCHAR(MAX),
	"icon_desc" NVARCHAR(MAX),
	"raw_hot" NVARCHAR(MAX),
	"mid" NVARCHAR(MAX),
	"emoticon" NVARCHAR(MAX),
	"icon_desc_color" NVARCHAR(MAX),
	"realpos" NVARCHAR(MAX),
	"onboard_time" SMALLDATETIME,
	"topic_flag" NVARCHAR(MAX),
	"ad_info" NVARCHAR(MAX),
	"fun_word" NVARCHAR(MAX),
	"note" NVARCHAR(MAX),
	"rank" NVARCHAR(MAX),
	"url" NVARCHAR(MAX)	
)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

为防止，字段给的不够，直接给个MAX！在这里插入图片描述

代码分析

先讲讲我的整体思路在逐步分析：

第一步：发送请求，获取网页信息
第二步：解析数据，提取我们所需要的数据
第三步：添加入库批次号
第四步：把数据存入数据库

第一步

发送请求，获取网页信息

提供了数据的接口，所以我们直接访问接口就行，如下图（json格式）：

# 接口地址：https://weibo.com/ajax/statuses/hot_band

1
2

在这里插入图片描述

def __init__(self) :
	self.url = "https://weibo.com/ajax/statuses/hot_band"
	self.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36"}
# 发送请求，获取相应 
def parse_url(self):
	response = requests.get(self.url,headers=self.headers)
	time.sleep(2) # 休息两秒
	
	return response.content.decode()

1
2
3
4
5
6
7
8
9
10

第二步

解析数据，提取我们所需要的数据

接口中的数据格式化如下（只需提取我们所需要的）：在这里插入图片描述

for i in range(50):
	ban_list = json_data['data']['band_list'][i]
	batch = f'第{a}批'
	try:
	    star_word = ban_list['star_word']
	except Exception as e:
	    print(e)
	try:
	    title = ban_list['word']
	except Exception as e:
	    print(e)
	try:
	    category = ban_list['category']
	except Exception as e:
	    print(e)
	try:
	    num = ban_list['num']
	except Exception as e:
	    print(e)
	try:
	    subject_querys = ban_list['subject_querys']
	except Exception as e:
	    print(e)
	try:
	    flag = ban_list['flag']
	except Exception as e:
	    print(e)
	try:
	    icon_desc = ban_list['icon_desc']
	except Exception as e:
	    print(e)  
	try:
	    raw_hot = ban_list['raw_hot']
	except Exception as e:
	    print(e)      
	try:
	    mid = ban_list['mid']
	except Exception as e:
	    print(e) 
	try:
	    emoticon = ban_list['emoticon']
	except Exception as e:
	    print(e)
	try:
	    icon_desc_color = ban_list['icon_desc_color']
	except Exception as e:
	    print(e)
	try:
	    realpos = ban_list['realpos']
	except Exception as e:
	    print(e)
	try:
	    onboard_time = ban_list['onboard_time']
	    onboard_time = datetime.datetime.fromtimestamp(onboard_time)
	except Exception as e:
	    print(e)
	try:
	    topic_flag = ban_list['topic_flag']
	except Exception as e:
	    print(e)
	try:
	    ad_info = ban_list['ad_info']
	except Exception as e:
	    print(e)
	try:
	    fun_word = ban_list['fun_word']
	except Exception as e:
	    print(e)   
	try:
	    note = ban_list['note']
	except Exception as e:
	    print(e)      
	try:
	    rank = ban_list['rank'] + 1
	except Exception as e:
	    print(e) 
	try:
	    url = json_data['data']['band_list'][i]['mblog']['text']
	    url = re.findall('href="(.*?)"',url)[0]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80

第三步

数据库的batch用于判断，每次插入的批次（50个一批），如果爬虫断了，写个方法还能接着上次的批次

如图：在这里插入图片描述

# 把数据库batch列存入列表并返回（用于判断批次号）
def batch(self):
	conn=pymssql.connect('.', 'sa', 'yuan427', 'test')
	cursor=conn.cursor()
	
	cursor.execute("select batch from WB_HotList") #向数据库发送SQL命令
	rows=cursor.fetchall()
	batchlist=[]
	for list in rows:
	    batchlist.append(list[0]) 
	
	return batchlist    

1
2
3
4
5
6
7
8
9
10
11
12
13

第四步

把数据存入数据库

# 连接数据库服务,创建游标对象
db = pymssql.connect('.', 'sa', 'yuan427', 'test') #服务器名,账户,密码,数据库名
if db:
    print("连接成功!")    
cursor= db.cursor()

try:
	# 插入sql语句
	sql = "insert into test4(batch,daydate,star_word,title,category,num,subject_querys,flag,icon_desc,raw_hot,mid,emoticon,icon_desc_color,realpos,onboard_time, \
	        topic_flag,ad_info,fun_word,note,rank,url) values (%s,getdate(),%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"
	
	# 执行插入操作
	cursor.execute(sql,(batch,star_word,title,category,num,subject_querys,flag,icon_desc,raw_hot,mid,emoticon,icon_desc_color,realpos,onboard_time,topic_flag,ad_info, \
	            fun_word,note,rank,url))
	db.commit()
	
	print('成功载入......' )
	
	except Exception as e:
	db.rollback()
	print(str(e))
    
# 关闭游标，断开数据库
cursor.close()
db.close()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

完整代码

import requests,pymssql,time,json,re,datetime
from threading import Timer

class Spider:
    def __init__(self) :
        self.url = "https://weibo.com/ajax/statuses/hot_band"
        self.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36"}

    # 发送请求，获取相应 
    def parse_url(self):
        response = requests.get(self.url,headers=self.headers)
        time.sleep(2)
        
        return response.content.decode()

    # 解析数据，入库
    def parse_data(self,data,a):
        json_data = json.loads(data)

        # 连接数据库服务,创建游标对象
        db = pymssql.connect('.', 'sa', 'yuan427', 'test') #服务器名,账户,密码,数据库名   
        cursor= db.cursor()
     
        for i in range(50):
            ban_list = json_data['data']['band_list'][i]
            
            
            batch = f'第{a}批'
            
            try:
                star_word = ban_list['star_word']
            except Exception as e:
                print(e)
            
            
            try:
                title = ban_list['word']
            except Exception as e:
                print(e)

            try:
                category = ban_list['category']
            except Exception as e:
                print(e)
            
            try:
                num = ban_list['num']
            except Exception as e:
                print(e)
        
            try:
                subject_querys = ban_list['subject_querys']
            except Exception as e:
                print(e)

            try:
                flag = ban_list['flag']
            except Exception as e:
                print(e)

            try:
                icon_desc = ban_list['icon_desc']
            except Exception as e:
                print(e)  

            try:
                raw_hot = ban_list['raw_hot']
            except Exception as e:
                print(e)      
            
            try:
                mid = ban_list['mid']
            except Exception as e:
                print(e) 
            
            try:
                emoticon = ban_list['emoticon']
            except Exception as e:
                print(e)
            
            try:
                icon_desc_color = ban_list['icon_desc_color']
            except Exception as e:
                print(e)
            
            try:
                realpos = ban_list['realpos']
            except Exception as e:
                print(e)
            
            try:
                onboard_time = ban_list['onboard_time']
                onboard_time = datetime.datetime.fromtimestamp(onboard_time)
            except Exception as e:
                print(e)
            
            try:
                topic_flag = ban_list['topic_flag']
            except Exception as e:
                print(e)
            
            try:
                ad_info = ban_list['ad_info']
            except Exception as e:
                print(e)
            
            try:
                fun_word = ban_list['fun_word']
            except Exception as e:
                print(e)   
            
            try:
                note = ban_list['note']
            except Exception as e:
                print(e)      
        
            try:
                rank = ban_list['rank'] + 1
            except Exception as e:
                print(e) 
            
            try:
                url = json_data['data']['band_list'][i]['mblog']['text']
                url = re.findall('href="(.*?)"',url)[0]
            except Exception as e:
                print(e)
           
            try:
                # 插入sql语句
                sql = "insert into test4(batch,daydate,star_word,title,category,num,subject_querys,flag,icon_desc,raw_hot,mid,emoticon,icon_desc_color,realpos,onboard_time, \
                        topic_flag,ad_info,fun_word,note,rank,url) values (%s,getdate(),%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"

                # 执行插入操作
                cursor.execute(sql,(batch,star_word,title,category,num,subject_querys,flag,icon_desc,raw_hot,mid,emoticon,icon_desc_color,realpos,onboard_time,topic_flag,ad_info, \
                            fun_word,note,rank,url))
                db.commit()

                print('成功载入......' )
            
            except Exception as e:
                db.rollback()
                print(str(e))
            
        # 关闭游标，断开数据库
        cursor.close()
        db.close()
         
    # 把数据库batch列存入列表并返回（用于判断批次号）
    def batch(self):
        conn=pymssql.connect('.', 'sa', 'yuan427', 'test')

        cursor=conn.cursor()

        cursor.execute("select batch from WB_HotList") #向数据库发送SQL命令

        rows=cursor.fetchall()
        batchlist=[]
        for list in rows:
            batchlist.append(list[0]) 

        return batchlist    
             
    # 实现主要逻辑 
    def run(self, a):
        
        # 根据数据库批次号给定a的值
        batchlist = self.batch()
        if len(batchlist) != 0:
            batch = batchlist[len(batchlist) -1]
            a = re.findall('第(.*?)批',batch)
            a = int(a[0]) + 1

        data = self.parse_url()

        self.parse_data(data,a)
        a +=1
        # 定时调用
        t = Timer(1800, self.run, (a, )) # 1800表示1800秒，半小时调用一次
        t.start()

    
if __name__ == "__main__": 
    spider = Spider()
    spider.run(1)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185

启动

因为需要一直运行，所以就在 cmd 挂着

运行成功后，去数据库看看：

在这里插入图片描述

总结

总之，使用Python爬取微博热搜榜单是一种获取有价值信息的方法。在实际应用中，我们需要根据具体情况选择合适的爬虫方法，并遵守相关法律法规和网站的使用规定。希望本文对你理解和使用Python爬取微博热搜榜单有所帮助。

关于Python学习指南

学好 Python 不论是就业还是做副业赚钱都不错，但要学会 Python 还是要有一个学习规划。最后给大家分享一份全套的 Python 学习资料，给那些想学习 Python 的小伙伴们一点帮助！

包括：Python激活码+安装包、Python web开发，Python爬虫，Python数据分析，人工智能、自动化办公等学习教程。带你从零基础系统性的学好Python！