Given the nature of my job, I have to work on new projects every week solving a different problem. My work requires me to parse through a lot of different kinds of datasets to design and develop instructions for Data Science aspirants.

鉴于我的工作性质,我每周必须从事新项目,以解决另一个问题。 我的工作需要我解析许多不同种类的数据集,以设计和开发针对数据科学有志者的说明。

The blog contains a few useful datasets and data repositories categorized in different classes of problems and industries.


网络上的数据存储库: (Data Repositories on the web:)

Google Dataset Portal
  • Google Dataset Search — a search engine for researchers to locate online data.

    Google数据集搜索 -搜索引擎,供研究人员查找在线数据。

  • datasetlist — offers a list of the biggest machine learning datasets from across the web.

    datasetlist -来自全国各地的网络提供了最大的机器学习数据集列表。

  • UCI — one of the oldest repositories with data classified by types of problems, attributes type, data type, the field of study, etc.

    UCI —最古老的存储库之一,其数据按问题类型,属性类型,数据类型,研究领域等分类。

  • fastai-datasets — datasets for Image classification, NLP and Image localization

    fastai数据集 -用于图像分类NLP图像本地化 的数据

  • NLP-datasets — Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing

    NLP数据集 —按字母顺序排列的自由/公共领域数据集,以及用于自然语言处理的文本数据

  • Bifrost — for visual datasets classified by task, application, class, label, and format.

    Bifrost —用于按任务,应用程序,类,标签和格式分类的可视数据集。

图像数据集 (Images Datasets)

  • ImageNetImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images.


  • CT Medical Images — designed to allow for different methods to be tested for examining the trends in CT image data associated with using contrast and patient age. The data consists of a tiny subset of images from the cancer imaging archive.

    CT医学图像 —设计为允许测试不同方法,以检查与使用对比度和患者年龄相关的CT图像数据趋势。 数据由癌症成像档案库中的一小部分图像组成。

  • Flickr-faces — Flickr-Faces-HQ (FFHQ) is a high-quality image dataset of human faces, originally created as a benchmark for generative adversarial networks (GAN).

    Flickr-faces — Flickr-Faces-HQ(FFHQ)是高质量的人脸图像数据集,最初创建为生成对抗网络(GAN)的基准。

  • objectnet — A new kind of vision dataset borrowing the idea of controls from other areas of science.

    objectnet —一种新的视觉数据集,它借鉴了其他科学领域的控制思想。

  • CelebFaces — Large-scale CelebFaces attributes

    CelebFaces —大型CelebFaces属性

  • Animal Faces-HQ dataset (AFHQ) — a dataset of animal faces, consisting of 15,000 high-quality images at 512×512 resolution.

    动物脸-HQ数据集(AFHQ) -动物脸的数据集,由15,000张高质量图像(512×512分辨率)组成。

NLP数据集 (NLP Datasets)

https://medium.com/@ODSC/20-open-datasets-for-natural-language-processing-538fbfaf8e38
  • nlp-datasets — Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP).


  • 1 trillion n-grams — linguistic data consortium. This data is expected to be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses.

    1万亿n克 -语言数据联盟。 预期该数据可用于统计语言建模,例如,用于机器翻译或语音识别以及其他用途。

  • litbank — LitBank is an annotated dataset of 100 works of English-language fiction to support tasks in natural language processing and the computational humanities.

    litbank — LitBank是带注释的数据集,其中包含100篇英语小说作品,以支持自然语言处理和计算人文科学方面的任务。

  • BookCorpus — these are scripts to reproduce BookCorpus by yourself.


  • rasa-nlu-training-data — Crowd-sourced training data for the development and testing of Rasa NLU models.

    rasa-nlu-training-data —用于开发和测试Rasa NLU模型的人群源训练数据。

  • Google book Ngram — it is an online search engine that charts the frequencies of any set of search strings using a yearly count of n-grams found in sources printed between 1500 and 2019 in Google’s text corpora in English, Chinese, French, German, Hebrew, Italian, Russian, or Spanish.


情绪分析 (Sentiment Analysis)

音讯 (Audio)

Audioset — a large scale dataset that consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos.


金融与经济 (Finance and Economy)

卫生保健 (Healthcare)

  • Kaggle Healthcare repository — AI in healthcare is a growing interest. One of the major problems is simply converting research into an application. Should be easy, right?

    Kaggle Healthcare存储库 —医疗保健中的AI越来越引起人们的关注。 主要问题之一就是将研究转化为应用程序。 应该很容易吧?

  • WHO: global health datasets.

    世卫组织 :全球卫生数据集。

  • CDC: Use this for US-specific public health.

    CDC :将其用于美国特定的公共卫生。

  • data.gov: US-focused healthcare data searchable by several different factors.

    data.gov :可通过多种因素搜索以美国为中心的医疗数据。

科学研究 (Scientific Research)

  • Re3Data: Over 2,000 research data repositories, re3data has become the most comprehensive source of reference for research data infrastructures globally.

    Re3Data :超过2,000个研究数据存储库,re3data已成为全球研究数据基础架构最全面的参考来源。

  • ELVIRA Biomedical Data Repository: High-dimensional datasets in the biomedical field. It focuses on journal-published data (Nature, Science, and others).

    ELVIRA生物医学数据存储库:生物医学领域中的高维数据集。 它着重于期刊出版的数据(自然,科学等)。

  • Merck Molecular Health Activity Challenge: Datasets designed to foster the machine learning pursuit of drug discovery by simulating how molecule combinations could interact with each other.

    默克分子健康活动挑战 :旨在通过模拟分子组合如何相互作用来促进对药物发现的机器学习追求的数据集。

  • SEER — Datasets arranged by demographic groups and provided by the US government. You can search based on age, race, and gender.

    SEER-由人口统计小组排列并由美国政府提供的数据集。 您可以根据年龄,种族和性别进行搜索。

  • CT Cancer Medical Images — designed to allow for different methods to be tested for examining the trends in CT image data associated with using contrast and patient age. The data are a tiny subset of images from the cancer imaging archive.

    CT Cancer Medical Images ( CT癌症医学图像) —设计为允许测试不同方法以检查与使用对比度和患者年龄相关的CT图像数据趋势。 数据是癌症成像档案中图像的一小部分。

航空航天与国防 (Aerospace and Defense)

  • NASA’s Data Portal — A continually growing catalog of publicly available NASA Datasets, APIs, visualizations, and more. Includes space science, aerospace, earth sciences, applied science, and management data.

    NASA的数据门户 -不断增长的公共可用NASA数据集,API,可视化等目录。 包括空间科学,航空航天,地球科学,应用科学和管理数据。

  • Airline Data Project — Commercial airline data sets from the MIT Global Airline Industry Program

    航空公司数据项目 -麻省理工学院全球航空公司产业计划的商业航空公司数据集

  • Astronomical Data Services — A variety of astronomical data available from the United States Naval Observatory’s (USNO). Data includes that related to the sun, moon, planets, and other celestial objects and more.

    天文数据服务 -可从美国海军天文台(USNO)获得的各种天文数据。 数据包括与太阳,月亮,行星和其他天体有关的数据,以及更多。

  • Astronomical Phenomena section of the Astronomical Almanac — Various phenomena of astronomical interest including solar, lunar, Geocentric, and Heliocentric. Tables of Sunrise, Sunset, and twilight are available as well as data for solar and lunar eclipses

    天文年历的天文现象部分 -天文感兴趣的各种现象,包括太阳,月球,地心和日心。 提供日出,日落和暮光表格以及Eclipse和月食数据

  • NASA’s Asteroid Data Sets — Provides access to PDS data on asteroids, dust, planetary satellites, meteorites, and more.

    NASA的小行星数据集 -提供对小行星,尘埃,行星卫星,陨石等的PDS数据的访问。

电子商务 (E-Commerce)

提供数据集的Python库 (Python Libraries that offer Datasets)

https://blog.tensorflow.org/2019/02/introducing-tensorflow-datasets.html
  • TensorFlow Datasets — a collection of ready-to-use datasets. TensorFlow Datasets is a collection of datasets ready to use, with TensorFlow or other Python ML frameworks, such as Jax. All datasets are exposed as tf.data.Datasets , enabling easy-to-use and high-performance input pipelines. To get started see the guide and our list of datasets.

    TensorFlow数据集 -即用型数据集的集合。 TensorFlow数据集是可以与TensorFlow或其他Python ML框架(例如Jax)一起使用的数据集的集合。 所有数据集都公开为tf.data.Datasets ,从而启用易于使用的高性能输入管道。 首先,请参阅指南和我们的数据集列表

  • S️klearn — Machine Learning package. This package also features helpers to fetch larger datasets commonly used by the machine learning community to benchmark algorithms on data that comes from the ‘real world’.

    S️klearn—机器学习软件包。 该软件包还具有帮助者获取大型数据集的功能,这些数据集通常被机器学习社区用来对来自“现实世界”的数据的算法进行基准测试。

  • ️ ️nltk: Natural Language Tool Kit package. Practical work in Natural Language Processing typically uses large bodies of linguistic data, or corpora.

    ️️nltk 自然语言工具包。 自然语言处理中的实际工作通常使用大量的语言数据或语料库。

  • statsmodel: Statistical Model package. Provides data sets (i.e. data and meta-data) for use in examples, tutorials, model testing, etc.

    ️statsmodel :统计模型包。 提供用于示例,教程,模型测试等的数据集(即数据元数据)。

  • pydataset — Dataset for educational purposes, mainly. It tries to help those approaching Data Science in Python for the first time, who must deal with common (and time-consuming) data preparation tasks.

    pydataset-主要用于教育目的的数据集。 它试图帮助那些首次使用Python接触数据科学的人,他们必须处理常见的(且耗时的)数据准备任务。

  • seaborn: Data Visualisation package where you can also load an example dataset from the online repository (requires internet).

    seaborn :数据可视化软件包,您还可以在其中从在线存储库中加载示例数据集(需要Internet)。

If there is any other important and authentic dataset or category you’d want me to add to this list, feel free to respond to this story!


