赞
踩
一般比价小型的爬虫需求,我是直接使用requests库 + bs4就解决了,再麻烦点就使用selenium解决js的异步 加载问题。相对比较大型的需求才使用框架,主要是便于管理以及扩展等。
Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中。
其最初是为了 页面抓取 (更确切来说, 网络抓取 )所设计的, 也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。
安装
pip install scrapy
- scrapy startproject tutorial
-
- ls
- tutorial/
- scrapy.cfg
- tutorial/
- __init__.py
- items.py
- pipelines.py
- settings.py
- spiders/
- __init__.py
- ...
- import scrapy
-
- class DmozSpider(scrapy.Spider):
- name = "dmoz"
- allowed_domains = ["dmoz.org"]
- start_urls = [
- "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
- "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
- ]
-
- def parse(self, response):
- filename = response.url.split("/")[-2]
- with open(filename, 'wb') as f:
- f.write(response.body)
scrapy crawl dmoz
这里就简单介绍一下,后面有时间详细写一些关于scrapy的文章,我的很多爬虫的数据都是scrapy基础上实现的。
项目地址:https://scrapy.org/
PySpider:一个国人编写的强大的网络爬虫系统并带有强大的WebUI。采用Python语言编写,分布式架构,支持多种数据库后端,强大的WebUI支持脚本编辑器,任务监视器,项目管理器以及结果查看器。
代开web界面的编辑输入代码即可
- from pyspider.libs.base_handler import *
-
-
- class Handler(BaseHandler):
- crawl_config = {
- }
-
- @every(minutes=24 * 60)
- def on_start(self):
- self.crawl('http://scrapy.org/', callback=self.index_page)
-
- @config(age=10 * 24 * 60 * 60)
- def index_page(self, response):
- for each in response.doc('a[href^="http"]').items():
- self.crawl(each.attr.href, callback=self.detail_page)
-
- def detail_page(self, response):
- return {
- "url": response.url,
- "title": response.doc('title').text(),
- }
-
Crawley可以高速爬取对应网站的内容,支持关系和非关系数据库,数据可以导出为JSON、XML等。
- ~$ crawley startproject [project_name]
- ~$ cd [project_name]
- """ models.py """
-
- from crawley.persistance import Entity, UrlEntity, Field, Unicode
-
- class Package(Entity):
-
- #add your table fields here
- updated = Field(Unicode(255))
- package = Field(Unicode(255))
- description = Field(Unicode(255))
- """ crawlers.py """
-
- from crawley.crawlers import BaseCrawler
- from crawley.scrapers import BaseScraper
- from crawley.extractors import XPathExtractor
- from models import *
-
- class pypiScraper(BaseScraper):
-
- #specify the urls that can be scraped by this class
- matching_urls = ["%"]
-
- def scrape(self, response):
-
- #getting the current document's url.
- current_url = response.url
- #getting the html table.
- table = response.html.xpath("/html/body/div[5]/div/div/div[3]/table")[0]
-
- #for rows 1 to n-1
- for tr in table[1:-1]:
-
- #obtaining the searched html inside the rows
- td_updated = tr[0]
- td_package = tr[1]
- package_link = td_package[0]
- td_description = tr[2]
-
- #storing data in Packages table
- Package(updated=td_updated.text, package=package_link.text, description=td_description.text)
-
- class pypiCrawler(BaseCrawler):
-
- #add your starting urls here
- start_urls = ["http://pypi.python.org/pypi"]
-
- #add your scraper classes here
- scrapers = [pypiScraper]
-
- #specify you maximum crawling depth level
- max_depth = 0
-
- #select your favourite HTML parsing tool
- extractor = XPathExtractor
- """ settings.py """
-
- import os
- PATH = os.path.dirname(os.path.abspath(__file__))
-
- #Don't change this if you don't have renamed the project
- PROJECT_NAME = "pypi"
- PROJECT_ROOT = os.path.join(PATH, PROJECT_NAME)
-
- DATABASE_ENGINE = 'sqlite'
- DATABASE_NAME = 'pypi'
- DATABASE_USER = ''
- DATABASE_PASSWORD = ''
- DATABASE_HOST = ''
- DATABASE_PORT = ''
-
- SHOW_DEBUG_INFO = True
-
~$ crawley run
Portia是一个开源可视化爬虫工具,可让您在不需要任何编程知识的情况下爬取网站!简单地注释您感兴趣的页面,Portia将创建一个蜘蛛来从类似的页面提取数据。
这个使用时超级简单,你们可以看一下文档。http://portia.readthedocs.io/en/latest/index.html
Newspaper可以用来提取新闻、文章和内容分析。使用多线程,支持10多种语言等。作者从requests库的简洁与强大得到灵感,使用python开发的可用于提取文章内容的程序。
支持10多种语言并且所有的都是unicode编码。
- >>> from newspaper import Article
-
- >>> url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
- >>> article = Article(url)
- >>> article.download()
-
- >>> article.html
- '<!DOCTYPE HTML><html itemscope itemtype="http://...'
- >>> article.parse()
-
- >>> article.authors
- ['Leigh Ann Caldwell', 'John Honway']
-
- >>> article.publish_date
- datetime.datetime(2013, 12, 30, 0, 0)
-
- >>> article.text
- 'Washington (CNN) -- Not everyone subscribes to a New Year's resolution...'
- >>> article.top_image
- 'http://someCDN.com/blah/blah/blah/file.png'
- >>> article.movies
- ['http://youtube.com/path/to/link.com', ...]
- >>> article.nlp()
- >>> article.keywords
- ['New Years', 'resolution', ...]
- >>> article.summary
- 'The study shows that 93% of people ...'
-
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间。这个我是使用的特别频繁的。在获取html元素,都是bs4完成的。
- # -*- coding: utf-8 -*-
- import scrapy
- from bs4 import BeautifulSoup
- from urllib.parse import urljoin
- from six.moves import urllib
- DOMAIN = 'http://flagpedia.asia'
-
-
- class FlagSpider(scrapy.Spider):
- name = 'flag'
- allowed_domains = ['flagpedia.asia', 'flags.fmcdn.net']
- start_urls = ['http://flagpedia.asia/index']
-
- def parse(self, response):
- html_doc = response.body
- soup = BeautifulSoup(html_doc, 'html.parser')
-
- a = soup.findAll('td', class_="td-flag")
- for i in a:
- url = i.a.attrs.get("href")
- full_url = urljoin(DOMAIN, url)
- yield scrapy.Request(full_url, callback=self.parse_news)
-
- def parse_news(self, response):
- html_doc = response.body
- soup = BeautifulSoup(html_doc, 'html.parser')
- p = soup.find("p", id="flag-detail")
- img_url = p.img.attrs.get("srcset").split(" 2x")[0]
- url = "http:" + img_url
- img_name = img_url.split("/")[-1]
-
- urllib.request.urlretrieve(url, "/Users/youdi/Project/python/Rino_nakasone_backend/RinoNakasone/flag/{}".format(img_name))
- print(url)
Grab是一个用于构建Web刮板的Python框架。借助Grab,您可以构建各种复杂的网页抓取工具,从简单的5行脚本到处理数百万个网页的复杂异步网站抓取工具。Grab提供一个API用于执行网络请求和处理接收到的内容,例如与HTML文档的DOM树进行交互。
项目地址:http://docs.grablib.org/en/latest/#grab-spider-user-manual
Cola是一个分布式的爬虫框架,对于用户来说,只需编写几个特定的函数,而无需关注分布式运行的细节。任务会自动分配到多台机器上,整个过程对用户是透明的。
Selenium 是自动化测试工具。它支持各种浏览器,包括 Chrome,Safari,Firefox 等主流界面式浏览器,如果在这些浏览器里面安装一个 Selenium 的插件,可以方便地实现Web界面的测试. Selenium 支持浏览器驱动。Selenium支持多种语言开发,比如 Java,C,Ruby等等,PhantomJS 用来渲染解析JS,Selenium 用来驱动以及与 Python 的对接,Python 进行后期的处理。
- from selenium import webdriver
- from selenium.webdriver.common.keys import Keys
-
- browser = webdriver.Firefox()
-
- browser.get('http://www.yahoo.com')
- assert 'Yahoo' in browser.title
-
- elem = browser.find_element_by_name('p') # Find the search box
- elem.send_keys('seleniumhq' + Keys.RETURN)
-
- browser.quit()
Python-goose框架可提取的信息包括:
- >>> from goose import Goose
- >>> url = 'http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2'
- >>> g = Goose()
- >>> article = g.extract(url=url)
- >>> article.title
- u'Occupy London loses eviction fight'
- >>> article.meta_description
- "Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoid eviction Wednesday in a decision made by London's Court of Appeal."
- >>> article.cleaned_text[:150]
- (CNN) -- Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoi
- >>> article.top_image.src
- http://i2.cdn.turner.com/cnn/dam/assets/111017024308-occupy-london-st-paul-s-cathedral-story-top.jpg
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。