Given the nature of my job, I have to work on new projects every week solving a different problem. My work requires me to parse through a lot of different kinds of datasets to design and develop instructions for Data Science aspirants.
鉴于我的工作性质,我每周必须从事新项目,以解决另一个问题。 我的工作需要我解析许多不同种类的数据集,以设计和开发针对数据科学有志者的说明。
The blog contains a few useful datasets and data repositories categorized in different classes of problems and industries.
该博客包含一些有用的数据集和数据存储库,这些数据集和数据存储库分为不同类别的问题和行业。
网络上的数据存储库: (Data Repositories on the web:)
Google Dataset Search — a search engine for researchers to locate online data.
Google数据集搜索 -搜索引擎,供研究人员查找在线数据。
datasetlist — offers a list of the biggest machine learning datasets from across the web.
datasetlist -来自全国各地的网络提供了最大的机器学习数据集列表。
UCI — one of the oldest repositories with data classified by types of problems, attributes type, data type, the field of study, etc.
UCI —最古老的存储库之一,其数据按问题类型,属性类型,数据类型,研究领域等分类。
fastai-datasets — datasets for Image classification, NLP and Image localization
NLP-datasets — Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing
NLP数据集 —按字母顺序排列的自由/公共领域数据集,以及用于自然语言处理的文本数据
Bifrost — for visual datasets classified by task, application, class, label, and format.
Bifrost —用于按任务,应用程序,类,标签和格式分类的可视数据集。
图像数据集 (Images Datasets)
ImageNet — ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images.
ImageNet — ImageNet是根据WordNet层次结构(目前仅是名词)组织的图像数据库,其中层次结构的每个节点都由成千上万个图像表示。
CT Medical Images — designed to allow for different methods to be tested for examining the trends in CT image data associated with using contrast and patient age. The data consists of a tiny subset of images from the cancer imaging archive.
CT医学图像 —设计为允许测试不同方法,以检查与使用对比度和患者年龄相关的CT图像数据趋势。 数据由癌症成像档案库中的一小部分图像组成。
Flickr-faces — Flickr-Faces-HQ (FFHQ) is a high-quality image dataset of human faces, originally created as a benchmark for generative adversarial networks (GAN).
Flickr-faces — Flickr-Faces-HQ(FFHQ)是高质量的人脸图像数据集,最初创建为生成对抗网络(GAN)的基准。
objectnet — A new kind of vision dataset borrowing the idea of controls from other areas of science.
objectnet —一种新的视觉数据集,它借鉴了其他科学领域的控制思想。
CelebFaces — Large-scale CelebFaces attributes
CelebFaces —大型CelebFaces属性
Animal Faces-HQ dataset (AFHQ) — a dataset of animal faces, consisting of 15,000 high-quality images at 512×512 resolution.
动物脸-HQ数据集(AFHQ) -动物脸的数据集,由15,000张高质量图像(512×512分辨率)组成。
NLP数据集 (NLP Datasets)
nlp-datasets — Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP).
nlp-datasets-用于自然语言处理(NLP)的具有文本数据的自由/公共领域数据集的字母顺序列表。
1 trillion n-grams — linguistic data consortium. This data is expected to be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses.
1万亿n克 -语言数据联盟。 预期该数据可用于统计语言建模,例如,用于机器翻译或语音识别以及其他用途。
litbank — LitBank is an annotated dataset of 100 works of English-language fiction to support tasks in natural language processing and the computational humanities.
litbank — LitBank是带注释的数据集,其中包含100篇英语小说作品,以支持自然语言处理和计算人文科学方面的任务。
BookCorpus — these are scripts to reproduce BookCorpus by yourself.
BookCorpus-这些是可以自己复制BookCorpus的脚本。
rasa-nlu-training-data — Crowd-sourced training data for the development and testing of Rasa NLU models.
rasa-nlu-training-data —用于开发和测试Rasa NLU模型的人群源训练数据。
Google book Ngram — it is an online search engine that charts the frequencies of any set of search strings using a yearly count of n-grams found in sources printed between 1500 and 2019 in Google’s text corpora in English, Chinese, French, German, Hebrew, Italian, Russian, or Spanish.
Google图书Ngram-这是一种在线搜索引擎,它使用1500到2019年间在Google文本语料库中以英语,中文,法语,德语,希伯来语打印的来源中发现的n-gram的年度计数来绘制任何一组搜索字符串的频率,意大利语,俄语或西班牙语。
情绪分析 (Sentiment Analysis)
Reviews — Amazon Reviews, Yelp Reviews, Movie Reviews, Food Reviews, Twitter Airline,
评论— 亚马逊评论 , Yelp评论 , 电影评论 , 美食评论 , Twitter航空公司 ,
Stanford Sentiment Treebank — This dataset contains just over 10,000 pieces of Stanford data from HTML files of Rotten Tomatoes.
Stanford Sentiment Treebank —该数据集包含来自Rotten TomatoesHTML文件的10,000多个Stanford数据。
Lexicoder Sentiment Dictionary — Lexicoder performs simple deductive content analyses of any kind of text, in almost any language.
Lexicoder情感词典 -Lexicoder对几乎所有语言的任何类型的文本执行简单的演绎内容分析。
Opinion Lexicon — A list of English positive and negative opinion words or sentiment words.
Opinion Lexicon —英语肯定和否定意见词或情感词的列表。
Conversational Datasets —A collection of large datasets for conversational response selection.
More — NRC-Emotion-Lexicon-Wordlevel, ISEAR(17K), HappyDB, emotion-to-emoji-mapping
更多— NRC-Emotion-Lexicon-Wordlevel , ISEAR(17K) , HappyDB , 情感到表情符号映射
音讯 (Audio)
Audioset — a large scale dataset that consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos.
音频集-大型数据集,包括632个音频事件类的扩展本体以及从YouTube视频中提取的2,084,320个人标记的10秒声音剪辑的集合。
金融与经济 (Finance and Economy)
Kaggle Finance datasets — The finance datasets are about money and investing. If you need to test some new cryptocurrency investment strategies or ward off those pesky credit card fraud enthusiasts, then you’ve come to the right place.
Kaggle财务数据集 -财务数据集涉及金钱和投资。 如果您需要测试一些新的加密货币投资策略或抵制那些讨厌的信用卡欺诈爱好者,那么您来对地方了。
CFPB Credit Card History — The number and aggregate credit limit of new credit cards opened each month.
CFPB信用卡历史记录 -每月开设的新信用卡的数量和总信用限额。
Top Banks — This dataset contains a lists of the world’s largest banks.
顶级银行 -此数据集包含世界上最大的银行的列表。
Student Loan Debt — A collection of student loan debt summary data, including debt balance by age, amount, and debt types.
学生贷款债务 -学生贷款债务摘要数据的集合,包括按年龄,金额和债务类型划分的债务余额。
International Monetary Fund, Financial Times Dataset, World Open Bank Data
卫生保健 (Healthcare)
Kaggle Healthcare repository — AI in healthcare is a growing interest. One of the major problems is simply converting research into an application. Should be easy, right?
Kaggle Healthcare存储库 —医疗保健中的AI越来越引起人们的关注。 主要问题之一就是将研究转化为应用程序。 应该很容易吧?
WHO: global health datasets.
世卫组织 :全球卫生数据集。
CDC: Use this for US-specific public health.
CDC :将其用于美国特定的公共卫生。
data.gov: US-focused healthcare data searchable by several different factors.
data.gov :可通过多种因素搜索以美国为中心的医疗数据。
科学研究 (Scientific Research)
Re3Data: Over 2,000 research data repositories, re3data has become the most comprehensive source of reference for research data infrastructures globally.
Re3Data :超过2,000个研究数据存储库,re3data已成为全球研究数据基础架构最全面的参考来源。
ELVIRA Biomedical Data Repository: High-dimensional datasets in the biomedical field. It focuses on journal-published data (Nature, Science, and others).
ELVIRA生物医学数据存储库:生物医学领域中的高维数据集。 它着重于期刊出版的数据(自然,科学等)。
Merck Molecular Health Activity Challenge: Datasets designed to foster the machine learning pursuit of drug discovery by simulating how molecule combinations could interact with each other.
默克分子健康活动挑战 :旨在通过模拟分子组合如何相互作用来促进对药物发现的机器学习追求的数据集。
SEER — Datasets arranged by demographic groups and provided by the US government. You can search based on age, race, and gender.
SEER-由人口统计小组排列并由美国政府提供的数据集。 您可以根据年龄,种族和性别进行搜索。
CT Cancer Medical Images — designed to allow for different methods to be tested for examining the trends in CT image data associated with using contrast and patient age. The data are a tiny subset of images from the cancer imaging archive.
CT Cancer Medical Images ( CT癌症医学图像) —设计为允许测试不同方法以检查与使用对比度和患者年龄相关的CT图像数据趋势。 数据是癌症成像档案中图像的一小部分。
航空航天与国防 (Aerospace and Defense)
NASA’s Data Portal — A continually growing catalog of publicly available NASA Datasets, APIs, visualizations, and more. Includes space science, aerospace, earth sciences, applied science, and management data.
NASA的数据门户 -不断增长的公共可用NASA数据集,API,可视化等目录。 包括空间科学,航空航天,地球科学,应用科学和管理数据。
Airline Data Project — Commercial airline data sets from the MIT Global Airline Industry Program
航空公司数据项目 -麻省理工学院全球航空公司产业计划的商业航空公司数据集
Astronomical Data Services — A variety of astronomical data available from the United States Naval Observatory’s (USNO). Data includes that related to the sun, moon, planets, and other celestial objects and more.
天文数据服务 -可从美国海军天文台(USNO)获得的各种天文数据。 数据包括与太阳,月亮,行星和其他天体有关的数据,以及更多。
Astronomical Phenomena section of the Astronomical Almanac — Various phenomena of astronomical interest including solar, lunar, Geocentric, and Heliocentric. Tables of Sunrise, Sunset, and twilight are available as well as data for solar and lunar eclipses
天文年历的天文现象部分 -天文感兴趣的各种现象,包括太阳,月球,地心和日心。 提供日出,日落和暮光表格以及Eclipse和月食数据
NASA’s Asteroid Data Sets — Provides access to PDS data on asteroids, dust, planetary satellites, meteorites, and more.
NASA的小行星数据集 -提供对小行星,尘埃,行星卫星,陨石等的PDS数据的访问。
电子商务 (E-Commerce)
提供数据集的Python库 (Python Libraries that offer Datasets)
TensorFlow Datasets — a collection of ready-to-use datasets. TensorFlow Datasets is a collection of datasets ready to use, with TensorFlow or other Python ML frameworks, such as Jax. All datasets are exposed as
tf.data.Datasets
, enabling easy-to-use and high-performance input pipelines. To get started see the guide and our list of datasets.TensorFlow数据集 -即用型数据集的集合。 TensorFlow数据集是可以与TensorFlow或其他Python ML框架(例如Jax)一起使用的数据集的集合。 所有数据集都公开为
tf.data.Datasets
,从而启用易于使用的高性能输入管道。 首先,请参阅指南和我们的数据集列表 。S️klearn — Machine Learning package. This package also features helpers to fetch larger datasets commonly used by the machine learning community to benchmark algorithms on data that comes from the ‘real world’.
S️klearn—机器学习软件包。 该软件包还具有帮助者获取大型数据集的功能,这些数据集通常被机器学习社区用来对来自“现实世界”的数据的算法进行基准测试。
️ ️nltk: Natural Language Tool Kit package. Practical work in Natural Language Processing typically uses large bodies of linguistic data, or corpora.
️️nltk :自然语言工具包。 自然语言处理中的实际工作通常使用大量的语言数据或语料库。
️statsmodel: Statistical Model package. Provides data sets (i.e. data and meta-data) for use in examples, tutorials, model testing, etc.
️statsmodel :统计模型包。 提供用于示例,教程,模型测试等的数据集(即数据和元数据)。
pydataset — Dataset for educational purposes, mainly. It tries to help those approaching Data Science in Python for the first time, who must deal with common (and time-consuming) data preparation tasks.
pydataset-主要用于教育目的的数据集。 它试图帮助那些首次使用Python接触数据科学的人,他们必须处理常见的(且耗时的)数据准备任务。
seaborn: Data Visualisation package where you can also load an example dataset from the online repository (requires internet).
seaborn :数据可视化软件包,您还可以在其中从在线存储库中加载示例数据集(需要Internet)。
If there is any other important and authentic dataset or category you’d want me to add to this list, feel free to respond to this story!
如果您希望我将任何其他重要且真实的数据集或类别添加到此列表中,请随时回复此故事!