python爬虫（1）——BeautifulSoup库函数find_all()_soup.findall

作者：天景科技苑 | 2024-08-20 05:31:10

踩

soup.findall

python——BeautifulSoup库函数find_all()

一、语法介绍

find_all( name , attrs , recursive , string , **kwargs )
find_all() 方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件

二、参数及用法介绍

1、name参数

这是最简单而直接的一种办法了，我么可以通过html标签名来索引；

sb = soup.find_all('img')1

注意： 搜索 name 参数的值可以使任一类型的过滤器 ,字符窜,正则表达式,列表,方法或是 True ；

2、keyword参数

所谓关键字参数其实就是通过一个html标签的id、href(这个主要指的是a标签的）和title,我测试了class，这个方法好像不行，不过没有关系，下面我会谈到这个点的！

soup.find_all(href=re.compile("elsie"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]1
2

soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]1
2

这里的true指的就是选中所有有id这个属性的标签；

soup.find_all(id=True)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]1
2
3
4

当然牙可以设置多个筛选的属性；

soup.find_all(href=re.compile("elsie"), id='link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]1
2

还有有些属性在搜索时就不能使用，就比如HTML5中的 data-* 属性，咋办？

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression1
2
3

但是可以通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的tag:

data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]1
2

虽然我们不能像id他们那样使用，因为class在python中是保留字（保留字(reserved word)，指在高级语言中已经定义过的字，使用者不能再将这些字作为变量名或过程名使用。
），所以呢，直接使用是回报错的，所以class_应运而生；
所以呢，顺便上一张图片，让我们看一看python都有哪些保留字：
这里写图片描述

通过标签名和属性名一起用：

soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]1
2
3
4

除此之外呢，还有就是class_ 参数同样接受不同类型的过滤器 ,字符串,正则表达式,方法或 True :当然，上面的属性也可以和标签名结合起来使用；

soup.find_all(class_=re.compile("itl"))
# [<p class="title"><b>The Dormouse's story</b></p>]

def has_six_characters(css_class):
    return css_class is not None and len(css_class) == 6

soup.find_all(class_=has_six_characters) 
#这里的这个函数，其实就是一个布尔值True；
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]1
2
3
4
5
6
7
8
9
10
11

sting参数

通过 string 参数可以搜搜文档中的字符串内容.与 name 参数的可选值一样, string 参数接受字符串 , 正则表达式 , 列表, True;

soup.find_all("a", string="Elsie")
# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]1
2

limit参数

这个参数其实就是控制我们获取数据的数量，效果和SQL语句中的limit一样；

soup.find_all("a", limit=2)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]1
2
3

recursive参数

调用tag的 find_all() 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False;
Html

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
...1
2
3
4
5
6
7

python

soup.html.find_all("title")
# [<title>The Dormouse's story</title>]

soup.html.find_all("title", recursive=False)
# []1
2
3
4
5

所以他只获取自己的直接子节点，也就是他自己,这个标签自己就是他的直接子节点；

Beautiful Soup 提供了多种DOM树搜索方法. 这些方法都使用了类似的参数定义. 比如这些方法: find_all(): name, attrs, text, limit. 但是只有 find_all() 和 find() 支持 recursive 参数.

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/天景科技苑/article/detail/1005446