当前位置:   article > 正文

lxml爬虫实战:爬取人生格言

lxml爬虫实战:爬取人生格言

1.爬取单个格言url中的内容

  1. from lxml import etree
  2. import requests
  3. header = {'User-Agent':
  4. 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 Edg/126.0.0.0'}
  5. # 1.获取单个格言url中的内容
  6. def get_content(url, header):
  7. r = requests.get(url, headers=header)
  8. html = etree.HTML(r.text)
  9. # soup = BeautifulSoup(r, 'html.parser')
  10. title = html.xpath('//article/h1/text()')[0]
  11. result = html.xpath('//div[@id="print-area"]/p/text()')
  12. content = "\n".join(result[1:])
  13. # div = soup.find('div', attrs={'id': 'print-area'})
  14. # result1 = div.find_all('p')
  15. # for t in result1:
  16. # with open('')
  17. return title, content

2.获得一个网页内的所有url中的数据

  1. # 2.获得一个网页内的所有链接
  2. base_url = "https://www.fenzhiwu.com/lizhigeyan/rensheng/"
  3. response = requests.get(base_url, header)
  4. response.encoding = response.apparent_encoding
  5. base_html = etree.HTML(response.text)
  6. urls = base_html.xpath('//div[@class="uk-width-medium-4-5"]/h2/a/@href')

3.下载保存格言内容

  1. # 3.下载保存格言内容
  2. for url in urls:
  3. url = 'https:' + url
  4. title, content = get_content(url, header)
  5. with open(f'格言/{title}.txt', 'w', encoding='utf-8') as f:
  6. f.write(title + '\n\n')
  7. f.write(content + '\n\n')
  8. print(f'已下载...{title}')
  9. print('下载完成!!!!!!')

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/笔触狂放9/article/detail/928065
推荐阅读
相关标签
  

闽ICP备14008679号