当前位置:   article > 正文

python读取html内容 dom获取_获取特定html源码 富文本编辑器 爬虫生成 dom

html = browser.page_source doc = pyquery(html) 获取herf

python beautifulsoup获取特定html源码 - 吴悟无 - 博客园 https://www.cnblogs.com/vickey-wu/p/6843411.html

PyQuery库的使用 - CSDN博客 https://blog.csdn.net/qw_xingzhe/article/details/75175256

Python爬虫:PyQuery库的介绍与使用 - 简书 https://www.jianshu.com/p/c07f7cd1b548

pyquery相当于jQuery的python实现,可以用于解析HTML网页等。它的语法与jQuery几乎完全相同,对于使用过jQuery的人来说很熟悉,也很好上手。

引用作者的原话就是:

“The API is as much as possible the similar to jquery.” 。

from selenium import webdriver

import time

import random

from bs4 import *

from pyquery import PyQuery as pq

browser = webdriver.Chrome()

url = 'https://so.gushiwen.org/shiwenv_ee16df5673bc.aspx'

browser.get(url)

js = "a_=document.getElementsByTagName('a');le=a_.length;for(i=0;i

try:

browser.execute_script(js)

except Exception as e:

print(e)

ck_l_ori_len = len(browser.find_elements_by_link_text('展开阅读全文 ∨'))

ck_l_ori_ok = 0

try:

for isc in range(100):

if ck_l_ori_ok == ck_l_ori_len:

break

time.sleep(1)

js = 'window.scrollTo(0,document.body.scrollHeight)'

js = 'window.scrollTo(0,100*{})'.format(isc)

browser.execute_script(js)

ck_l = browser.find_elements_by_link_text('展开阅读全文 ∨')

for i in ck_l:

try:

i.click()

ck_l_ori_ok += 1

except Exception as e:

print(e)

except Exception as e:

print('window.scrollTo-->', e)

doc = pq(browser.page_source)

pq_r_d = {'xmlns="http://www.w3.org/1999/xhtml"': ''}

r_k, r_v = 'xmlns="http://www.w3.org/1999/xhtml"', ''

article_ = doc('.left>:nth-child(2).sons>.cont>.contson').html().replace(r_k, r_v)

title_d = {'h1': doc('.left>:nth-child(2).sons>.cont>:nth-child(2)').html().replace(r_k, r_v)}

author_d = {'h3': doc('.left>:nth-child(2).sons>.cont>:nth-child(3)').text()}

translation_ = doc('.left>:nth-child(4)>.contyishang>:nth-child(2)').html().replace(r_k, r_v)

explanation_ = doc('.left>:nth-child(4)>.contyishang>:nth-child(3)').html().replace(r_k, r_v)

refer_ = doc('.left>:nth-child(4)>.cankao').html().replace(r_k, r_v)

author_img_url = doc('.left>.sonspic>.cont>.divimg>:nth-child(1)').html().split('src="')[-1].split('"')[0]

k = 'h1'

v = title_d[k]

db_html = '{}{}>'.format(k, v, k)

k = 'h3'

v = author_d[k]

db_html = '{}{}{}>'.format(db_html, k, v, k)

db_html = '{}{}'.format(db_html, '
'.format(author_img_url))

l = [db_html, article_, explanation_, translation_, refer_]

db_html = '
'.join(l)

rp_s_l = ['', 'X'), db_html[p1 + 1:])

p2 = tmp.index('>')

db_html = '{}{}{}'.format(db_html[0:p1], '', db_html[p2 + 1:])

本文内容由网友自发贡献,转载请注明出处:https://www.wpsshop.cn/w/2023面试高手/article/detail/396752
推荐阅读
相关标签
  

闽ICP备14008679号