当前位置:   article > 正文

Lxml模块_python的lxml模块

python的lxml模块

lxml: 它可以分析xml文件,html是xml的子集,所以分析html文档可以使用正则也可以使用lxml
示例文档

<bookstore>
        <li id='test3'> li test3</li>
        <book>
          <title>Harry Potter</title>
          <author>J K. Rowling</author>
          <year>2005</year>
          <price>29.99</price>
          <li>li test1</li>
          <li id='test2'>li test2</li>
        </book>
</bookstore>
<test>
    <li id='test3'>li test4</li>
</test>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

lxml示例

实例1: 找到Harry Potter
/bookstore/book/title
实例: 找到book里面所有li
/bookstore/book/li
实例: 找到bookstore里面所有li
/bookstore/book/li|/bookstore/li (|表示或的意思)
/bookstore//li //表示不管层次只要是li全部找到
实例: 找到整个文档中的li
//li
实例:找到所有含有id属性的li
//li[@id]
实例:找到所有含有id属性的li,并且id的值为test3
//li[@id=‘test3’]
实例:找到所有li的id属性
//li/@id 得到标签中的属性值
//li/text() 得到标签中的内容

一个完整示例:

from lxml import etree
html = '''    <bookstore>
            <li id='test3'> li test3</li>
            <book>
              <title>Harry Potter</title>
              <author>J K. Rowling</author>
              <year>2005</year>
              <price>29.99</price>
              <li>li test1</li>
              <li id='test2'>li test2</li>
            </book>
    </bookstore>
    <test>
        <li id='test3'>li test4</li>
    </test>'''

dom = etree.HTML(html)
ret = dom.xpath('//li/text()')
print(ret)
ret = dom.xpath('//li/@id')
print(ret)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21

一个完整示例:

   from lxml import etree
    html = '''    <bookstore>
                <li id='test3'> li test3</li>
                <book>
                  <title>Harry Potter</title>
                  <author>J K. Rowling</author>
                  <year>2005</year>
                  <price>29.99</price>
                  <li>li test1</li>
                  <li id='test2'>li test2</li>
                </book>
        </bookstore>
        <test>
            <li id='test3'>li test4</li>
        </test>'''

    dom = etree.HTML(html)
    ret = dom.xpath('//li[@id]')
    for li in ret:
        print(li.text)
        print(li.attrib['id'])
        print(etree.tostring(li).decode())
        print('=' * 50)



  #爬取暴漫非人哉漫画
    #author : shuaijie_liu
    #date 2019-05-01
    #email : 15028349493@163.com
    import requests
    from lxml import etree
    def down_html(url,timeout=10,headers=None,verify=True):
        if not headers:
            headers = {
                'User-Agent':r'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36,'
            }
            req = requests.get(url=url,headers=headers,verify=verify,timeout=timeout)
            return req.text
    def find_imgs(data,exp):
        dom = etree.HTML(data)
        ret = dom.xpath(exp)
        return ret
    def download_img(url,filename,timeout=10,headers=None,verify=True):
        if not headers:
            headers = {
                'User-Agent':r'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36,'
            }
        req = requests.get(url=url,headers=headers,verify=verify,timeout=timeout)
        with open(filename,'wb') as f:
            f.write(req.content)
    if __name__ == '__main__':
        for page in range(27):
            url = r'http://baozoumanhua.com/channels/1562?page={}'.format(page)
            imgs = r'//div[@class="article-body"]//img/@src'
            try:
                html = down_html(url=url)
            except Exception as e:
                print('Html Error {} : {}'.format(url,e))
                continue
            img_urls = find_imgs(html,imgs)
            ret = [img_urls[0]]
            for url in img_urls:
                if url != ret[-1]: ret.append(url)
            filename = 0
            for url in ret:
                filename += 1
                file = "{}-{}.jpg".format(page+1,filename)
                print('down load {}'.format(url))
                try:
                    download_img(url,file)
                except Exception as e:
                    print('IMAGE ERROR {}:{}'.format(url,e))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/羊村懒王/article/detail/591613
推荐阅读
相关标签
  

闽ICP备14008679号