赞
踩
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库。
它通过转换器实现文档导航,查找,修改文档的方式。
若使用的是新版的ubuntu,可以通过系统的软件包管理来安装:
$ apt-get install Python-bs4
若无法使用系统包管理安装,那么也可以通过 easy_install 或 pip 来安装.包的名字是 beautifulsoup4 ,这个包兼容Python2和Python3.
$ easy_install beautifulsoup4
$ pip install beautifulsoup4
若没有安装 easy_install 或 pip ,那你也可以 下载BS4的源码 解压后,进入到beautifulsoup目录下,然后通过setup.py来安装.(Windows下的beautifulsoup安装过程和此方法一样)
$ Python setup.py install
如果此时代码抛出了异常,可能是因为你在Python2版本中执行Python3版本的代码或你在Python3版本中执行Python2的代码.最好的解决方法是重新安装BeautifulSoup4.
假设需要将把BS4的Python代码版本从Python2转换到Python3. 可以重新安装BS4:
$ Python3 setup.py install
或在bs4的目录中执行Python代码版本转换脚本
$ 2to3-3.2 -w bs4
BeautifulSoup本身支持Python标准库中的HTML解析器
但若想使BeautifulSoup使用html5lib解析器,可以使用下面方法安装:
$ pip install html5lib
若想使BeautifulSoup使用lxml 解析器,可以使用下面方法安装:
$ pip install lxml
from bs4 import BeautifulSoup #导入BeautifulSoup4库
soup = BeautifulSoup("<html>hello python</html>") #得到文档的对象
print(soup)
'''
结果:
<html><body><p>hello python</p></body></html>
'''
Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: Tag
, NavigableString
, BeautifulSoup
, Comment
.
from bs4 import BeautifulSoup soup = BeautifulSoup('<a href="www.baidu.com">baidu</a>') tag = soup.a print(tag) print(type(tag)) ''' result: <a href="www.baidu.com">baidu</a> <class 'bs4.element.Tag'> ''' print('tag.name:',tag.name) tag.name = 'b' print(tag) ''' result: tag.name: a <b href="www.baidu.com">baidu</b> ''' print(tag.attrs) print(tag['href']) tag['href'] = 'www.163.com' print(tag['href']) del tag print(tag) ''' result: {'href': 'www.baidu.com'} www.baidu.com www.163.com Traceback (most recent call last): File "UseBeautifulSoup4.py", line 21, in <module> print(tag) NameError: name 'tag' is not defined ''' #若含有多个值的属性也可以进行操作 soup = BeautifulSoup('<p class="t1 t2"></p>') print(soup.p['class']) soup.p['class'] = ['t3','s1'] print(soup.p['class']) ''' result: ['t1', 't2'] ['t3', 's1'] '''
用来包装tag中的字符串
soup = BeautifulSoup('<p class="t1">testong</p>') tag = soup.p print(tag.string) ''' result: testong ''' #用来替换字符串 print(tag.string) tag.string.replace_with(" one two three") print(tag.string) ''' result: testong one two three '''
BeautifulSoup对象表示的是一个文档的全部内容,它包含了一个值为’[document]'的属性
soup = BeautifulSoup('<p class="t1">testong</p>')
print(soup.name)
'''
result:
[document]
'''
Comment对象用于操作文档的注释部分
soup = BeautifulSoup('<p class="t1"><!-- when where who --></p>')
print(soup.p.string)
print('string type ',type(soup.p.string))
print(soup.p.prettify())
'''
result:
when where who
string type <class 'bs4.element.Comment'>
<p class="t1">
<!-- when where who -->
</p>
'''
使用例子:
from bs4 import BeautifulSoup soup = BeautifulSoup(''' <!DOCTYPE HTML> <html lang="zh-CN"> <head itemprop="video" itemscope itemtype="//schema.org/VideoObject"> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>王牌对王牌4之姚晨沙溢再聚同福客栈</title> </head> <body> <div is="i71-header" page-name="" class="qy-header" id="block-A" v-bind:non-index='true'> <!--@<template slot="header" slot-scope="props">@--> <div id="nav_logo" class="qy-logo" style="display: none;" :style="{ display: 'block'}"> <i class="logo-dot"></i> <a class="logo-channel" title="综艺" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6"><h2>综艺</h2></a> </div> <div class="qy-player-head-list" is="i71-playpage-source-video-floater" style="display:none;"></div> </div> </body></html> ''')
#通过tag.name可以获取标签 print(soup.head) print() print(soup.div) ''' result: <head itemprop="video" itemscope="" itemtype="//schema.org/VideoObject"> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/> <title>王牌对王牌4之姚晨沙溢再聚同福客栈</title> </head> <div class="qy-header" id="block-A" is="i71-header" page-name="" v-bind:non-index="true"> <!--@<template slot="header" slot-scope="props">@--> <div :style="{ display: 'block'}" class="qy-logo" id="nav_logo" style="display: none;"> <i class="logo-dot"></i> <a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a> </div> <div class="qy-player-head-list" is="i71-playpage-source-video-floater" style="display:none;"></div> </div> ''' #使用find_all()方法查找所有的标签 print(soup.find_all('div')) ''' result: [<div class="qy-header" id="block-A" is="i71-header" page-name="" v-bind:non-index="true"> <!--@<template slot="header" slot-scope="props">@--> <div :style="{ display: 'block'}" class="qy-logo" id="nav_logo" style="display: none;"> <i class="logo-dot"></i> <a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a> </div> <div class="qy-player-head-list" is="i71-playpage-source-video-floater" style="display:none;"></div> </div>, <div :style="{ display: 'block'}" class="qy-logo" id="nav_logo" style="display: none;"> <i class="logo-dot"></i> <a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a> </div>, <div class="qy-player-head-list" is="i71-playpage-source-video-floater" style="display:none;"></div>] '''
tag的.contents
属性会将tag的子节点以列表形式输出
tag = soup.head
print(tag)
print()
print(tag.contents)
'''
result:
<head itemprop="video" itemscope="" itemtype="//schema.org/VideoObject">
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>王牌对王牌4之姚晨沙溢再聚同福客栈</title>
</head>
['\n', <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>, '\n', <title>王牌对王牌4之姚晨沙溢再聚同福
客栈</title>, '\n']
'''
tag的.children
属性可以对tag的子节点进行循环
for t in tag.children:
print(t)
'''
result:
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>王牌对王牌4之姚晨沙溢再聚同福客栈</title>
'''
tag的.children和.contents只包含tag的直接子节点,.descendants
可以直接对所有的子孙节点进行递归循环
for t in tag.descendants:
print(t)
'''
result:
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>王牌对王牌4之姚晨沙溢再聚同福客栈</title>
王牌对王牌4之姚晨沙溢再聚同福客栈
'''
如果tag只有一个NavgableString类型的子节点,可以使用.string
得到子节点
tag = soup.head
print(tag.string)
title_tag = tag.title
print(title_tag.string)
'''
result:
None
王牌对王牌4之姚晨沙溢再聚同福客栈
'''
如果tag中有多个字符串,可以使用.strings
来循环获取
for str in soup.strings: print(repr(str)) ''' '\n' '\n' '\n' '王牌对王牌4之姚晨沙溢再聚同福客栈' '\n' '\n' '\n' '\n' '\n' '\n' '\n' '综艺' '\n' '\n' '\n' '\n' '\n' '''
使用.stripped_strings
可以去除多余空白内容
for str in soup.stripped_strings:
print(repr(str))
'''
'王牌对王牌4之姚晨沙溢再聚同福客栈'
'综艺'
'''
可以通过.parent
属性来获取某个元素的父节点
tag = soup.title
print(tag.parent)
'''
<head itemprop="video" itemscope="" itemtype="//schema.org/VideoObject">
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>王牌对王牌4之姚晨沙溢再聚同福客栈</title>
</head>
'''
可以通过.parents
属性递归得到元素的所有父节点
tag = soup.title
for p in tag.parents:
if p is None:
print(p)
else:
print(p.name)
'''
head
html
[document]
'''
通过.next_sibling
和.previous_sibling
属性来操作兄弟节点
#.previous_sibling的使用 tag = soup.a previous_tag = tag.previous_sibling print(previous_tag) print(previous_tag.previous_sibling) ''' result: 这里是一个输出,空格也算一个节点 <i class="logo-dot"></i> ''' #.next_sibling的使用 tag = soup.i next_tag = tag.next_sibling print(next_tag) print(next_tag.next_sibling) ''' result: <a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a> '''
通过.next_siblings
和.previous_siblings
属性可以迭代输出所有的兄弟节点
#.previous_siblings的使用 tag = soup.a for previous in tag.previous_siblings: print(repr(previous)) ''' result: '\n' <i class="logo-dot"></i> '\n' ''' #.next_siblings的使用 tag = soup.i for next in tag.next_siblings: print(repr(next)) ''' result: '\n' <a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a> '\n' '''
通过.next_element和.previous_element可以解析下一个或上一个对象
tag = soup.a #previous_element print(tag.next_element) print(tag.next_element.next_element) ''' result: 该tag上一个对象是\n <i class="logo-dot"></i> ''' #.next_element print(tag.next_element) ''' result: <h2>综艺</h2> '''
通过.next_elements和.previous_elements可以迭代解析下一个或上一个对象
#.previous_element tag = soup.head for e in tag.previous_elements: print(e) ''' result: <html lang="zh-CN"> <head itemprop="video" itemscope="" itemtype="//schema.org/VideoObject"> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/> <title>王牌对王牌4之姚晨沙溢再聚同福客栈</title> </head> <body> <div class="qy-header" id="block-A" is="i71-header" page-name="" v-bind:non-index="true"> <!--@<template slot="header" slot-scope="props">@--> <div :style="{ display: 'block'}" class="qy-logo" id="nav_logo" style="display: none;"> <i class="logo-dot"></i> <a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a> </div> <div class="qy-player-head-list" is="i71-playpage-source-video-floater" style="display:none;"></div> </div> </body></html> HTML ''' #next_element tag = soup.h2 for e in tag.next_elements: print(e) ''' result: 综艺 <div class="qy-player-head-list" is="i71-playpage-source-video-floater" style="display:none;"></div> '''
使用例子:
from bs4 import BeautifulSoup soup = BeautifulSoup(''' <!DOCTYPE HTML> <html lang="zh-CN"> <head itemprop="video" itemscope itemtype="//schema.org/VideoObject"> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>王牌对王牌4之姚晨沙溢再聚同福客栈</title> </head> <body> <div is="i71-header" page-name="" class="qy-header" id="block-A" v-bind:non-index='true'> <!--@<template slot="header" slot-scope="props">@--> <div id="nav_logo" class="qy-logo" style="display: none;" :style="{ display: 'block'}"> <i class="logo-dot"></i> <a class="logo-channel" title="综艺" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6"><h2>综艺</h2></a> </div> <div class="qy-player-head-list" is="i71-playpage-source-video-floater" style="display:none;"></div> </div> </body></html> ''')
find_all(name,attrs,recursive,string,**kwargs)
#name参数 #查找所有名字为name的tag print(soup.find_all("a")) ''' result: [<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>] ''' #keyword参数 #将属性作为key值来查找 import re print(soup.find_all(id='nav_logo')) print(soup.find_all(href=re.compile("zongyi/"))) #有些tag在搜索中不能使用,但可以使用attrs参数来定义参数 #print(soup.find_all(class="qy-logo")) 此处结果会报错 SyntaxError: invalid syntax print(soup.find_all(attrs=["class","qy-logo"])) ''' result: [<div :style="{ display: 'block'}" class="qy-logo" id="nav_logo" style="display: none;"> <i class="logo-dot"></i> <a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a> </div>] [<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>] [<div :style="{ display: 'block'}" class="qy-logo" id="nav_logo" style="display: none;"> <i class="logo-dot"></i> <a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a> </div>] ''' #css参数 #class在Python是保留字,使用class作为参数将会报错,但BeautifulSoup4.1.1版本之后,可以通过class_参数搜索 print(soup.find_all('i',class_='logo-dot')) ''' result: [<i class="logo-dot"></i>] ''' #text参数 #通过text参数可以搜索文档中的字符串的内容,text参数也可以是正则、列表等 print(soup.find_all(text="综艺")) ''' result: ['综艺'] ''' #limit参数 #使用limit属性来限制返回值的数量 print(soup.find_all("div",limit=1)) ''' result: [<div class="qy-header" id="block-A" is="i71-header" page-name="" v-bind:non-index="true"> <!--@<template slot="header" slot-scope="props">@--> <div :style="{ display: 'block'}" class="qy-logo" id="nav_logo" style="display: none;"> <i class="logo-dot"></i> <a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a> </div> <div class="qy-player-head-list" is="i71-playpage-source-video-floater" style="display:none;"></div> </div>] ''' #recursive参数 #find_all()方法默认会搜索当前tag的所有子孙节点,若只想搜索直接子节点,将recursive参数设为False即可 print(soup.find_all("div",id='nav_logo',recursive=True)) print(soup.find_all("div",id='nav_logo',recursive=False)) ''' result: [<div :style="{ display: 'block'}" class="qy-logo" id="nav_logo" style="display: none;"> <i class="logo-dot"></i> <a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a> </div>] [] '''
若只想得到一个结果,可以使用find()方法
print(soup.find("title"))
'''
result:
<title>王牌对王牌4之姚晨沙溢再聚同福客栈</title>
'''
#soup.find("title") 等价于soup.find_all('title',limit=1)
在find_all()方法中传一个字符串作为参数
print(soup.find_all('a'))
'''
result:
[<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>]
'''
在find_all()方法中传一个正则表达式作为参数
import re
for tag in soup.find_all(re.compile("^b")):
print(tag.name)
'''
result:
body
'''
在find_all()方法中传入一个列表作为参数
print(soup.find_all(["i","a"]))
'''
result:
[<i class="logo-dot"></i>, <a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>]
'''
True可以匹配任何值
for tag in soup.find_all(True): print(tag.name) ''' result: html head meta title body div div i a h2 div '''
在find_all()方法中传入一个方法作为参数
def method1(tag):
return tag.has_attr('class') and not tag.has_attr('id')
print(soup.find_all(method1))
'''
result:
[<i class="logo-dot"></i>, <a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>, <div class="qy-player-head-list" is="i71-playpage-source-video-floater" style="display:none;"></div>]
'''
用来搜索当前节点的父辈节点
a_string = soup.find(text="综艺")
print(a_string)
print(a_string.find_parents("a"))
print(a_string.find_parent("a"))
'''
result:
综艺
[<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>]
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>
'''
用来查找兄弟节点,find_next_siblings()可以迭代查出所有的兄弟节点,find_next_sibling()只能查出符合条件的第一个兄弟节点
print(soup.i.find_next_siblings("a"))
print(soup.i.find_next_sibling("a"))
'''
result:
[<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>]
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>
'''
用来查找当前节点后面的节点
print(soup.i.find_all_next())
print(soup.i.find_next())
'''
result:
[<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>, <h2>综 艺</h2>, <div class="qy-player-head-list" is="i71-playpage-source-video-floater" style="display:none;"></div>]
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>
'''
查找当前节点前面的节点
print(soup.title.find_all_previous()) print(soup.title.find_previous()) ''' result: [<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>, <head itemprop="video" itemscope="" itemtype="//schema.org/VideoObject"> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/> <title>王牌对王牌4之姚晨沙溢再聚同福客栈</title> </head>, <html lang="zh-CN"> <head itemprop="video" itemscope="" itemtype="//schema.org/VideoObject"> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/> <title>王牌对王牌4之姚晨沙溢再聚同福客栈</title> </head> <body> <div class="qy-header" id="block-A" is="i71-header" page-name="" v-bind:non-index="true"> <!--@<template slot="header" slot-scope="props">@--> <div :style="{ display: 'block'}" class="qy-logo" id="nav_logo" style="display: none;"> <i class="logo-dot"></i> <a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a> </div> <div class="qy-player-head-list" is="i71-playpage-source-video-floater" style="display:none;"></div> </div> </body></html>] <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/> '''
使用 .select() 方法传入字符串参数即可查找
#通过tag来查找 print(soup.select('a')) ''' result: [<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>] ''' #通过id来查找 print(soup.select('#nav_logo')) ''' result: [<div :style="{ display: 'block'}" class="qy-logo" id="nav_logo" style="display: none;"> <i class="logo-dot"></i> <a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a> </div>] ''' #通过class来查找 print(soup.select('.qy-logo')) ''' result: [<div :style="{ display: 'block'}" class="qy-logo" id="nav_logo" style="display: none;"> <i class="logo-dot"></i> <a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a> </div>] ''' #通过属性的值来查找 print(soup.select('div[style="display:none;"]')) ''' result: [<div class="qy-player-head-list" is="i71-playpage-source-video-floater" style="display:none;"></div>] '''
使用例子:
from bs4 import BeautifulSoup soup = BeautifulSoup(''' <!DOCTYPE HTML> <html lang="zh-CN"> <head itemprop="video" itemscope itemtype="//schema.org/VideoObject"> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>王牌对王牌4之姚晨沙溢再聚同福客栈</title> </head> <body> <div is="i71-header" page-name="" class="qy-header" id="block-A" v-bind:non-index='true'> <!--@<template slot="header" slot-scope="props">@--> <div id="nav_logo" class="qy-logo" style="display: none;" :style="{ display: 'block'}"> <i class="logo-dot"></i> <a class="logo-channel" title="综艺" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6"><h2>综艺</h2></a> </div> <div class="qy-player-head-list" is="i71-playpage-source-video-floater" style="display:none;"></div> </div> </body></html> ''')
tag = soup.i print(tag) tag.name = "a" print(tag) tag['class']='logo' print(tag) del tag['class'] print(tag) ''' result: <i class="logo-dot"></i> <a class="logo-dot"></a> <a class="logo"></a> <a></a> '''
tag = soup.h2
print(tag)
tag.string = "zongyi"
print(tag)
'''
result:
<h2>综艺</h2>
<h2>zongyi</h2>
'''
用于往字符串中追加内容
tag = soup.h2
print(tag)
tag.append(" hhhh ")
print(tag)
'''
result:
<h2>综艺</h2>
<h2>综艺 hhhh </h2>
'''
#new_string()方法是BeautifulSoup对象的,不是tag的 s1 = BeautifulSoup("<b></b>") tag = s1.b print(tag) tag.append(s1.new_string(" test ")) print(tag) ''' result: s1 = BeautifulSoup("<b></b>") <b></b> <b> test </b> ''' #添加注释 s1 = BeautifulSoup("<b></b>") tag = s1.b print(tag) from bs4 import Comment comment = s1.new_string("1 2 3",Comment) tag.append(comment) print(tag) ''' result: s1 = BeautifulSoup("<b></b>") <b></b> <b><!--1 2 3--></b> ''' #添加新的节点 s1 = BeautifulSoup("<b></b>") tag = s1.b print(tag) new_tag = s1.new_tag("a",href="http://www.baidu.com") tag.append(new_tag) print(tag) ''' result: s1 = BeautifulSoup("<b></b>") <b></b> <b><a href="http://www.baidu.com"></a></b> '''
# insert() tag = soup.a tag.insert(0," hello ") print(tag) tag.insert(2," world ") print(tag) ''' result: <a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"> hello <h2>综艺</h2></a> <a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"> hello <h2>综艺</h2> world </a> ''' # insert_before() tag = soup.a tag1 = soup.i tag1.string = "hello" tag.string.insert_before(tag1) print(tag) ''' result: <a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2><i class="logo-dot">hello</i>综艺</h2></a> ''' # insert_after() tag = soup.a tag1 = soup.i tag1.string = "hello" tag.string.insert_after(tag1) print(tag) ''' result: <a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺<i class="logo-dot">hello</i></h2></a> '''
用于移除当前节点的内容
tag = soup.a
print(tag)
tag.clear()
print(tag)
'''
result:
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"></a>
'''
将当前节点移除文档树
tag = soup.a
print(tag)
h_tag = tag.h2.extract()
print(tag)
print(h_tag)
'''
result:
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"></a>
<h2>综艺</h2>
'''
将当前节点移除文档树并完全销毁
tag = soup.a
print(tag)
tag.h2.decompose()
print(tag)
'''
result:
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"></a>
'''
用新tag或文本节点替换文档树的部分内容
tag = soup.a
print(tag)
new_tag = soup.new_tag("b")
new_tag.string = "test"
tag.h2.replace_with(new_tag)
print(tag)
'''
result:
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><h2>综艺</h2></a>
<a class="logo-channel" href="//www.iqiyi.com/zongyi/" rseat="708151_channel_6" title="综艺"><b>test</b></a>
'''
对指定元素进行包装和解包
# wrap() tag = BeautifulSoup("<p>I wish I was bold.</p>") print(tag) tag.string.wrap(tag.new_tag("b")) print(tag) ''' result: tag = BeautifulSoup("<p>I wish I was bold.</p>") <html><body><p>I wish I was bold.</p></body></html> <html><body><p><b>I wish I was bold.</b></p></body></html> ''' #unwrap() tag = BeautifulSoup("<p>I wish I was bold.</p>") print(tag) tag.string.wrap(tag.new_tag("b")) print(tag) tag.b.unwrap() print(tag) ''' result: tag = BeautifulSoup("<p>I wish I was bold.</p>") <html><body><p>I wish I was bold.</p></body></html> <html><body><p><b>I wish I was bold.</b></p></body></html> <html><body><p>I wish I was bold.</p></body></html> '''
以上是我通过BeautifulSoup4文档
学习BeautifulSoup4的过程,可能有些地方写的不够详细,但仍希望对其他初学者有帮助,若想了解更多,请参考Beautiful Soup Documentation
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。