当前位置:   article > 正文

BeautifulSoup4的介绍与使用_ 'tag' is not defined

'tag' is not defined


python环境
Python 3.7.1

BeautifulSoup的简介

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库。
它通过转换器实现文档导航,查找,修改文档的方式。
  • 1
  • 2

BeautifulSoup4的安装

安装

若使用的是新版的ubuntu,可以通过系统的软件包管理来安装:

$ apt-get install Python-bs4
  • 1

若无法使用系统包管理安装,那么也可以通过 easy_install 或 pip 来安装.包的名字是 beautifulsoup4 ,这个包兼容Python2和Python3.

$ easy_install beautifulsoup4
$ pip install beautifulsoup4
  • 1
  • 2

若没有安装 easy_install 或 pip ,那你也可以 下载BS4的源码 解压后,进入到beautifulsoup目录下,然后通过setup.py来安装.(Windows下的beautifulsoup安装过程和此方法一样)

$ Python setup.py install
  • 1

出现的问题

如果此时代码抛出了异常,可能是因为你在Python2版本中执行Python3版本的代码或你在Python3版本中执行Python2的代码.最好的解决方法是重新安装BeautifulSoup4.

假设需要将把BS4的Python代码版本从Python2转换到Python3. 可以重新安装BS4:

$ Python3 setup.py install
  • 1

或在bs4的目录中执行Python代码版本转换脚本

$ 2to3-3.2 -w bs4
  • 1

安装解析器

BeautifulSoup本身支持Python标准库中的HTML解析器
但若想使BeautifulSoup使用html5lib解析器,可以使用下面方法安装:

$ pip install html5lib
  • 1

若想使BeautifulSoup使用lxml 解析器,可以使用下面方法安装:

$ pip install lxml
  • 1

BeautifulSoup4的使用

使用

from bs4 import BeautifulSoup						#导入BeautifulSoup4库
soup = BeautifulSoup("<html>hello python</html>")	#得到文档的对象
print(soup)

'''
结果:
<html><body><p>hello python</p></body></html>
'''
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

对象的种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: Tag , NavigableString , BeautifulSoup, Comment .

Tag

from bs4 import BeautifulSoup

soup = BeautifulSoup('<a href="www.baidu.com">baidu</a>')

tag = soup.a

print(tag)
print(type(tag))
'''
result:
<a href="www.baidu.com">baidu</a>
<class 'bs4.element.Tag'>
'''

print('tag.name:',tag.name)
tag.name = 'b'
print(tag)
'''
result:
tag.name: a
<b href="www.baidu.com">baidu</b>
'''

print(tag.attrs)
print(tag['href'])
tag['href'] = 'www.163.com'
print(tag['href'])

del tag
print(tag)
'''
result:
{'href': 'www.baidu.com'}
www.baidu.com
www.163.com
Traceback (most recent call last): 
File "UseBeautifulSoup4.py", line 21, in <module>
print(tag)
NameError: name 'tag' is not defined 
'''

#若含有多个值的属性也可以进行操作
soup = BeautifulSoup('<p class="t1 t2"></p>')
print(soup.p['class'])
soup.p['class'] = ['t3','s1']
print(soup.p['class'])
'''
result:
['t1', 't2']
['t3', 's1']
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52

NavigableString

用来包装tag中的字符串

soup = BeautifulSoup('<p class="t1">testong</p>')
tag = soup.p
print(tag.string)
'''
result:
testong 
'''

#用来替换字符串
print(tag.string)
tag.string.replace_with(" one two three")
print(tag.string)
'''
result:
testong
one two three  
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18

BeautifulSoup

BeautifulSoup对象表示的是一个文档的全部内容,它包含了一个值为’[document]'的属性

soup = BeautifulSoup('<p class="t1">testong</p>')
print(soup.name)
'''
result:
[document]
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

Comment

Comment对象用于操作文档的注释部分

soup = BeautifulSoup('<p class="t1"><!-- when where who --></p>')
print(soup.p.string)
print('string type ',type(soup.p.string))
print(soup.p.prettify())
'''
result:
 when where who 
string type  <class 'bs4.element.Comment'>
<p class="t1"> 
<!-- when where who --> 
</p>  
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

遍历文档树

使用例子:

from bs4 import BeautifulSoup

soup = BeautifulSoup('''
<!DOCTYPE HTML>
<html lang="zh-CN">
 <head itemprop="video" itemscope itemtype="//schema.org/VideoObject">
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>王牌对王牌4之姚晨沙溢再聚同福客栈</title>
 </head>
 <body>
<div is="i71-header" page-name="" class="qy-header" id="block-A" v-bind:non-index='true'>
    <!--@<template slot="header" slot-scope="props">@-->
    <div id="nav_logo" class="qy-logo" style="display: none;" :style="{ display: 'block'}">
            <i class="logo-dot"></i>
            <a class="logo-channel" title="综艺" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6"><h2>综艺</h2></a>
            </div>
            <div class="qy-player-head-list" is="i71-playpage-source-video-floater" style="display:none;"></div>
</div>           
</body></html> 
''')

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21

子节点

tagName
#通过tag.name可以获取标签
print(soup.head)
print()
print(soup.div)

'''
result:
<head itemprop="video" itemscope="" itemtype="//schema.org/VideoObject">
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>王牌对王牌4之姚晨沙溢再聚同福客栈</title>                                                                  
</head>  
<div class="qy-header" id="block-A" is="i71-header" page-name="" v-bind:non-index="true"> 
<!--@<template slot="header" slot-scope="props">@-->  
<div :style="{ display: 'block'}" class="qy-logo" id="nav_logo" style="display: none;"> 
<i class="logo-dot"></i>
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>
</div>
<div class="qy-player-head-list" is="i71-playpage-source-video-floater" style="display:none;"></div>
</div> 
'''

#使用find_all()方法查找所有的标签
print(soup.find_all('div'))

'''
result:
[<div class="qy-header" id="block-A" is="i71-header" page-name="" v-bind:non-index="true">
<!--@<template slot="header" slot-scope="props">@--> 
<div :style="{ display: 'block'}" class="qy-logo" id="nav_logo" style="display: none;">
<i class="logo-dot"></i> 
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a> 
</div>  
<div class="qy-player-head-list" is="i71-playpage-source-video-floater" style="display:none;"></div>
</div>, <div :style="{ display: 'block'}" class="qy-logo" id="nav_logo" style="display: none;"> 
<i class="logo-dot"></i>
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>
</div>, <div class="qy-player-head-list" is="i71-playpage-source-video-floater" style="display:none;"></div>] 
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
.contents和.children
.contents

tag的.contents属性会将tag的子节点以列表形式输出

tag = soup.head
print(tag)
print()
print(tag.contents)

'''
result:
<head itemprop="video" itemscope="" itemtype="//schema.org/VideoObject">
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>王牌对王牌4之姚晨沙溢再聚同福客栈</title>
</head>
['\n', <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>, '\n', <title>王牌对王牌4之姚晨沙溢再聚同福
客栈</title>, '\n'] 
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
.children

tag的.children属性可以对tag的子节点进行循环


for t in tag.children:
	print(t)
'''
result:
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>

<title>王牌对王牌4之姚晨沙溢再聚同福客栈</title>                                                                                                                         
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
.descendants

tag的.children和.contents只包含tag的直接子节点,.descendants可以直接对所有的子孙节点进行递归循环


for t in tag.descendants:
	print(t)
'''
result:
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>

<title>王牌对王牌4之姚晨沙溢再聚同福客栈</title>
王牌对王牌4之姚晨沙溢再聚同福客栈
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
.string

如果tag只有一个NavgableString类型的子节点,可以使用.string得到子节点


tag = soup.head
print(tag.string)

title_tag = tag.title
print(title_tag.string)
'''
result:
None
王牌对王牌4之姚晨沙溢再聚同福客栈
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
.strings

如果tag中有多个字符串,可以使用.strings来循环获取


for str in soup.strings:
	print(repr(str))
'''
'\n'
'\n'
'\n'
'王牌对王牌4之姚晨沙溢再聚同福客栈'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'\n'
'综艺'
'\n'
'\n'
'\n'
'\n'
'\n'
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
.stripped_strings

使用.stripped_strings可以去除多余空白内容


for str in soup.stripped_strings:
	print(repr(str))
'''
'王牌对王牌4之姚晨沙溢再聚同福客栈'
'综艺'
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

父节点

.parent

可以通过.parent属性来获取某个元素的父节点


tag = soup.title
print(tag.parent)
'''
<head itemprop="video" itemscope="" itemtype="//schema.org/VideoObject">
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>王牌对王牌4之姚晨沙溢再聚同福客栈</title>
</head>
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
.parents

可以通过.parents属性递归得到元素的所有父节点


tag = soup.title

for p in tag.parents:
	if p is None:
		print(p)
	else:
		print(p.name)
'''
head
html
[document]
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

兄弟节点

.next_sibling和.previous_sibling

通过.next_sibling.previous_sibling属性来操作兄弟节点

#.previous_sibling的使用
tag = soup.a
previous_tag = tag.previous_sibling

print(previous_tag)
print(previous_tag.previous_sibling)
'''
result:
				这里是一个输出,空格也算一个节点
<i class="logo-dot"></i>
'''

#.next_sibling的使用
tag = soup.i
next_tag = tag.next_sibling

print(next_tag)
print(next_tag.next_sibling)
'''
result:

<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
.next_siblings和.previous_siblings

通过.next_siblings.previous_siblings属性可以迭代输出所有的兄弟节点

#.previous_siblings的使用
tag = soup.a

for previous in tag.previous_siblings:
	print(repr(previous))
'''
result:
'\n'
<i class="logo-dot"></i>
'\n'
'''

#.next_siblings的使用
tag = soup.i

for next in tag.next_siblings:
	print(repr(next))
'''
result:
'\n'
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>
'\n'
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24

前进和回退

.next_element 和 .previous_element

通过.next_element和.previous_element可以解析下一个或上一个对象

tag = soup.a
#previous_element
print(tag.next_element)
print(tag.next_element.next_element)
'''
result:
							该tag上一个对象是\n
<i class="logo-dot"></i>
'''

#.next_element
print(tag.next_element)
'''
result:
<h2>综艺</h2>
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
.next_elements 和 .previous_elements

通过.next_elements和.previous_elements可以迭代解析下一个或上一个对象

#.previous_element
tag = soup.head
for e in tag.previous_elements:
	print(e)
'''
result:

<html lang="zh-CN">
<head itemprop="video" itemscope="" itemtype="//schema.org/VideoObject">
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>王牌对王牌4之姚晨沙溢再聚同福客栈</title>
</head>
<body>
<div class="qy-header" id="block-A" is="i71-header" page-name="" v-bind:non-index="true">
<!--@<template slot="header" slot-scope="props">@-->
<div :style="{ display: 'block'}" class="qy-logo" id="nav_logo" style="display: none;">
<i class="logo-dot"></i>
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>
</div>
<div class="qy-player-head-list" is="i71-playpage-source-video-floater" style="display:none;"></div>
</div>
</body></html>
HTML
'''

#next_element
tag = soup.h2
for e in tag.next_elements:
	print(e)
'''
result:
综艺




<div class="qy-player-head-list" is="i71-playpage-source-video-floater" style="display:none;"></div>






'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45

搜索文档树

使用例子:

from bs4 import BeautifulSoup

soup = BeautifulSoup('''
<!DOCTYPE HTML>
<html lang="zh-CN">
 <head itemprop="video" itemscope itemtype="//schema.org/VideoObject">
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>王牌对王牌4之姚晨沙溢再聚同福客栈</title>
 </head>
 <body>
<div is="i71-header" page-name="" class="qy-header" id="block-A" v-bind:non-index='true'>
    <!--@<template slot="header" slot-scope="props">@-->
    <div id="nav_logo" class="qy-logo" style="display: none;" :style="{ display: 'block'}">
            <i class="logo-dot"></i>
            <a class="logo-channel" title="综艺" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6"><h2>综艺</h2></a>
            </div>
            <div class="qy-player-head-list" is="i71-playpage-source-video-floater" style="display:none;"></div>
</div>           
</body></html> 
''')

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21

find_all()

find_all(name,attrs,recursive,string,**kwargs)


#name参数
#查找所有名字为name的tag
print(soup.find_all("a"))
'''
result:
[<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>]
'''

#keyword参数
#将属性作为key值来查找
import re
	
print(soup.find_all(id='nav_logo'))
print(soup.find_all(href=re.compile("zongyi/")))
#有些tag在搜索中不能使用,但可以使用attrs参数来定义参数
#print(soup.find_all(class="qy-logo"))  此处结果会报错 SyntaxError: invalid syntax
print(soup.find_all(attrs=["class","qy-logo"]))
'''
result:
[<div :style="{ display: 'block'}" class="qy-logo" id="nav_logo" style="display: none;">
<i class="logo-dot"></i>
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>
</div>]
[<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>]

[<div :style="{ display: 'block'}" class="qy-logo" id="nav_logo" style="display: none;">
<i class="logo-dot"></i>
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>
</div>]
'''

#css参数
#class在Python是保留字,使用class作为参数将会报错,但BeautifulSoup4.1.1版本之后,可以通过class_参数搜索
print(soup.find_all('i',class_='logo-dot'))
'''
result:
[<i class="logo-dot"></i>]
'''

#text参数
#通过text参数可以搜索文档中的字符串的内容,text参数也可以是正则、列表等
print(soup.find_all(text="综艺"))
'''
result:
['综艺']
'''

#limit参数
#使用limit属性来限制返回值的数量
print(soup.find_all("div",limit=1))
'''
result:
[<div class="qy-header" id="block-A" is="i71-header" page-name="" v-bind:non-index="true">
<!--@<template slot="header" slot-scope="props">@-->
<div :style="{ display: 'block'}" class="qy-logo" id="nav_logo" style="display: none;">
<i class="logo-dot"></i>
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>
</div>
<div class="qy-player-head-list" is="i71-playpage-source-video-floater" style="display:none;"></div>
</div>]
'''

#recursive参数
#find_all()方法默认会搜索当前tag的所有子孙节点,若只想搜索直接子节点,将recursive参数设为False即可
print(soup.find_all("div",id='nav_logo',recursive=True))
print(soup.find_all("div",id='nav_logo',recursive=False))
'''
result:
[<div :style="{ display: 'block'}" class="qy-logo" id="nav_logo" style="display: none;">
<i class="logo-dot"></i>
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>
</div>]
[]
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76

find()

若只想得到一个结果,可以使用find()方法

print(soup.find("title"))
'''
result:
<title>王牌对王牌4之姚晨沙溢再聚同福客栈</title>
'''
#soup.find("title") 等价于soup.find_all('title',limit=1)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

过滤器

字符串

在find_all()方法中传一个字符串作为参数

print(soup.find_all('a'))
'''
result:
[<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>]
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
正则表达式

在find_all()方法中传一个正则表达式作为参数

import re

for tag in soup.find_all(re.compile("^b")):
	print(tag.name)
'''
result:
body
'''
	
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
列表

在find_all()方法中传入一个列表作为参数


print(soup.find_all(["i","a"]))
'''
result:
[<i class="logo-dot"></i>, <a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>]
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
True

True可以匹配任何值


for tag in soup.find_all(True):
	print(tag.name)
'''
result:
html
head
meta
title
body
div
div
i
a
h2
div
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
方法

在find_all()方法中传入一个方法作为参数


def method1(tag):
	return tag.has_attr('class') and not tag.has_attr('id')
	
print(soup.find_all(method1))
'''
result:
[<i class="logo-dot"></i>, <a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>, <div class="qy-player-head-list" is="i71-playpage-source-video-floater" style="display:none;"></div>]
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

find_parents()和find_parent()

用来搜索当前节点的父辈节点

a_string = soup.find(text="综艺")
print(a_string)
print(a_string.find_parents("a"))
print(a_string.find_parent("a"))
'''
result:
综艺
[<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>]
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>
'''
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

find_next_siblings()和find_next_sibling()

用来查找兄弟节点,find_next_siblings()可以迭代查出所有的兄弟节点,find_next_sibling()只能查出符合条件的第一个兄弟节点

print(soup.i.find_next_siblings("a"))
print(soup.i.find_next_sibling("a"))
'''
result:
[<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>]
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

find_all_next() 和 find_next()

用来查找当前节点后面的节点

print(soup.i.find_all_next())
print(soup.i.find_next())
'''
result:
[<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>, <h2>综 艺</h2>, <div class="qy-player-head-list" is="i71-playpage-source-video-floater" style="display:none;"></div>]
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

find_all_previous() 和 find_previous()

查找当前节点前面的节点

print(soup.title.find_all_previous())
print(soup.title.find_previous())
'''
result:
[<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>, <head itemprop="video" itemscope="" itemtype="//schema.org/VideoObject">
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>王牌对王牌4之姚晨沙溢再聚同福客栈</title>
</head>, <html lang="zh-CN">
<head itemprop="video" itemscope="" itemtype="//schema.org/VideoObject">
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>王牌对王牌4之姚晨沙溢再聚同福客栈</title>
</head>
<body>
<div class="qy-header" id="block-A" is="i71-header" page-name="" v-bind:non-index="true">
<!--@<template slot="header" slot-scope="props">@-->
<div :style="{ display: 'block'}" class="qy-logo" id="nav_logo" style="display: none;">
<i class="logo-dot"></i>
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>
</div>
<div class="qy-player-head-list" is="i71-playpage-source-video-floater" style="display:none;"></div>
</div>
</body></html>]
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25

CSS选择器

使用 .select() 方法传入字符串参数即可查找

#通过tag来查找
print(soup.select('a'))
'''
result:
[<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>]
'''

#通过id来查找
print(soup.select('#nav_logo'))
'''
result:
[<div :style="{ display: 'block'}" class="qy-logo" id="nav_logo" style="display: none;">
<i class="logo-dot"></i>
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>
</div>]
'''

#通过class来查找
print(soup.select('.qy-logo'))
'''
result:
[<div :style="{ display: 'block'}" class="qy-logo" id="nav_logo" style="display: none;">
<i class="logo-dot"></i>
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>
</div>]
'''

#通过属性的值来查找
print(soup.select('div[style="display:none;"]'))
'''
result:
[<div class="qy-player-head-list" is="i71-playpage-source-video-floater" style="display:none;"></div>]
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34

修改文档树

使用例子:

from bs4 import BeautifulSoup

soup = BeautifulSoup('''
<!DOCTYPE HTML>
<html lang="zh-CN">
 <head itemprop="video" itemscope itemtype="//schema.org/VideoObject">
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>王牌对王牌4之姚晨沙溢再聚同福客栈</title>
 </head>
 <body>
<div is="i71-header" page-name="" class="qy-header" id="block-A" v-bind:non-index='true'>
    <!--@<template slot="header" slot-scope="props">@-->
    <div id="nav_logo" class="qy-logo" style="display: none;" :style="{ display: 'block'}">
            <i class="logo-dot"></i>
            <a class="logo-channel" title="综艺" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6"><h2>综艺</h2></a>
            </div>
            <div class="qy-player-head-list" is="i71-playpage-source-video-floater" style="display:none;"></div>
</div>           
</body></html> 
''')

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21

修改tag的名称和属性

tag = soup.i
print(tag)
tag.name = "a"
print(tag)
tag['class']='logo'
print(tag)
del tag['class']
print(tag)
'''
result:
<i class="logo-dot"></i>
<a class="logo-dot"></a>
<a class="logo"></a>
<a></a>
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16

修改 .string

tag = soup.h2
print(tag)
tag.string = "zongyi"
print(tag)
'''
result:
<h2>综艺</h2>
<h2>zongyi</h2>
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

append()

用于往字符串中追加内容

tag = soup.h2
print(tag)
tag.append(" hhhh ")
print(tag)
'''
result:
<h2>综艺</h2>
<h2>综艺 hhhh </h2>
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

BeautifulSoup.new_string() 和 .new_tag()

#new_string()方法是BeautifulSoup对象的,不是tag的
s1 = BeautifulSoup("<b></b>")
tag = s1.b
print(tag)
tag.append(s1.new_string(" test "))
print(tag)
'''
result:
  s1 = BeautifulSoup("<b></b>")
<b></b>
<b> test </b>
'''

#添加注释
s1 = BeautifulSoup("<b></b>")
tag = s1.b
print(tag)
from bs4 import Comment
comment = s1.new_string("1 2 3",Comment)
tag.append(comment)
print(tag)
'''
result:
  s1 = BeautifulSoup("<b></b>")
<b></b>
<b><!--1 2 3--></b>
'''

#添加新的节点
s1 = BeautifulSoup("<b></b>")
tag = s1.b
print(tag)
new_tag = s1.new_tag("a",href="http://www.baidu.com")
tag.append(new_tag)
print(tag)
'''
result:
  s1 = BeautifulSoup("<b></b>")
<b></b>
<b><a href="http://www.baidu.com"></a></b>
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42

插入

# insert()
tag = soup.a
tag.insert(0," hello ")
print(tag)
tag.insert(2," world ")
print(tag)
'''
result:
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"> hello <h2>综艺</h2></a>
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"> hello <h2>综艺</h2> world </a>
'''

# insert_before()
tag = soup.a
tag1 = soup.i
tag1.string = "hello"
tag.string.insert_before(tag1)
print(tag)
'''
result:
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2><i class="logo-dot">hello</i>综艺</h2></a>
'''

# insert_after()
tag = soup.a
tag1 = soup.i
tag1.string = "hello"
tag.string.insert_after(tag1)
print(tag)
'''
result:
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺<i class="logo-dot">hello</i></h2></a>
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34

clear()

用于移除当前节点的内容

tag = soup.a
print(tag)
tag.clear()
print(tag)
'''
result:
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"></a>
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

extract()

将当前节点移除文档树

tag = soup.a
print(tag)
h_tag = tag.h2.extract()
print(tag)
print(h_tag)
'''
result:
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"></a>
<h2>综艺</h2>
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

decompose()

将当前节点移除文档树并完全销毁

tag = soup.a
print(tag)
tag.h2.decompose()
print(tag)
'''
result:
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"></a>
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

replace_with()

用新tag或文本节点替换文档树的部分内容

tag = soup.a
print(tag)
new_tag = soup.new_tag("b")
new_tag.string = "test"
tag.h2.replace_with(new_tag)
print(tag)
'''
result:
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><b>test</b></a>
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

wrap() 和 unwrap()

对指定元素进行包装和解包

# wrap()
tag = BeautifulSoup("<p>I wish I was bold.</p>")
print(tag)
tag.string.wrap(tag.new_tag("b"))
print(tag)
'''
result:
  tag = BeautifulSoup("<p>I wish I was bold.</p>")
<html><body><p>I wish I was bold.</p></body></html>
<html><body><p><b>I wish I was bold.</b></p></body></html>
'''

#unwrap()
tag = BeautifulSoup("<p>I wish I was bold.</p>")
print(tag)
tag.string.wrap(tag.new_tag("b"))
print(tag)
tag.b.unwrap()
print(tag)
'''
result:
  tag = BeautifulSoup("<p>I wish I was bold.</p>")
<html><body><p>I wish I was bold.</p></body></html>
<html><body><p><b>I wish I was bold.</b></p></body></html>
<html><body><p>I wish I was bold.</p></body></html>
'''

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27

最后

以上是我通过BeautifulSoup4文档学习BeautifulSoup4的过程,可能有些地方写的不够详细,但仍希望对其他初学者有帮助,若想了解更多,请参考Beautiful Soup Documentation

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/繁依Fanyi0/article/detail/140357?site
推荐阅读
相关标签
  

闽ICP备14008679号