赞
踩
Chrome安装插件:https://www.webscraper.io/
例如CSDN某网页:https://blogdev.blog.csdn.net/article/list/
在网页上右键,点击检查,在检查页面右上角三个点设置位置,选择dock side为下方,即可将视图调整到下方。
若是正确安装了Web Scraper,此时在检查页面最上方一行的最右侧会出现Web Scraper选项,如图
点击进入,在上方三个选项中先点击第三个create new sitemap,此处sitemap name名字可以自定义,start url网页填写想要爬取的网页,如https://blogdev.blog.csdn.net/article/list/,另外这里要说一点,如果想要爬取某几页,这里的网页填写可以为https://blogdev.blog.csdn.net/article/list/[1-5],这样将会爬取对应页面。
之后点击Add new selector 添加一个selector,这个selector起到总领作用,其具体设置如图
Id可以自定义,Type要选择Element,之后点击Select去网页上点击对应的块,这里将每篇文章及相关信息总共作为一个块进行划分,如图
最后一定要确保这个Element类型的Selector勾选了Multiple,点击Save保存。
接下来对于这个块进一步提取信息,在其下创建多个平行Selector分别对于标题、内容、时间、阅读量、评论数等进行选择,注意这里的类型为SelectorText,不能勾选Multiple,Parent selectors选择之前的那个Element类型的Selector,如图
此时可以点击Sitemap csdn下面的Selector Graph查看,如图
即一个Element且为Multiple的选择器后续跟着多个Text等类型非Multiple的选择器。
之后点击Sitemap csdn下面的Scrape,参数按照默认的,点击Start scraping开始爬取,爬取后页面暂时不会显示数据需要点击一下refresh data才会显示,之后点击Sitemap csdn下面的Export data选择对应格式文件下载即可,如图即为爬取的结果:
若仍不知道怎么写这个sitemap可以采用这个,点击Create new sitemap下面的Import Sitemap输入进去即可。
{"_id":"csdn","startUrl":["https://blogdev.blog.csdn.net/article/list/[1-5]"],"selectors":[{"id":"title","parentSelectors":["total"],"type":"SelectorText","selector":"h4 a","multiple":false,"regex":""},{"id":"content","parentSelectors":["total"],"type":"SelectorText","selector":"p.content","multiple":false,"regex":""},{"id":"time","parentSelectors":["total"],"type":"SelectorText","selector":"span.date","multiple":false,"regex":""},{"id":"read","parentSelectors":["total"],"type":"SelectorText","selector":"span.read-num:nth-of-type(2)","multiple":false,"regex":""},{"id":"comments","parentSelectors":["total"],"type":"SelectorText","selector":"span.read-num:nth-of-type(3)","multiple":false,"regex":""},{"id":"total","parentSelectors":["_root"],"type":"SelectorElement","selector":"div.article-item-box","multiple":true}]}
另一个例子,对于搜车网站如图:
其网址为:https://car.autohome.com.cn/searchcar
同样,创建Sitemap,这里因为每页之间在网址上面没有显示变化,使用一个新的Selector命名为Page,type选择为Pagination,勾选Multiple,如图:
之后再和之前同样增加一个Element型的Selector,但是注意除了勾选Multiple还要将Parent Selectors选择为page,如图:
再之后就和之前一样了在cars这个Element型下面新建一些Text型注意不够选Multiple且Parent为cars,如图
查看selector graph
之后进行scrape,此时会将每一页都爬取到,时间会长一点,最后export data下载文件,如图
完成。
爬取车辆的sitemap也放在这里:
{"_id":"car","startUrl":["https://car.autohome.com.cn/searchcar"],"selectors":[{"id":"page","parentSelectors":["_root","page"],"paginationType":"auto","selector":".findcar-page__num a","type":"SelectorPagination"},{"id":"cars","parentSelectors":["page"],"type":"SelectorElement","selector":"div.content","multiple":true},{"id":"name","parentSelectors":["cars"],"type":"SelectorText","selector":"h3","multiple":false,"regex":""},{"id":"price","parentSelectors":["cars"],"type":"SelectorText","selector":"span.price","multiple":false,"regex":""},{"id":"type","parentSelectors":["cars"],"type":"SelectorText","selector":".info dd","multiple":false,"regex":""}]}
Jarvis 230314
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。