当前位置:   article > 正文

【机器学习】数据集 汇总_power plant数据集

power plant数据集

Face recognition

In computer vision, face images have been used extensively to develop face recognition systems, face detection, and many other projects that use images of faces.

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Face Recognition Technology (FERET)11338 images of 1199 individuals in different positions and at different times.None.11,338ImagesClassification, face recognition2003[6][7]United States Department of Defense
CMU Pose, Illumination, and Expression (PIE)41,368 color images of 68 people in 13 different poses.Images labeled with expressions.41,368Images, textClassification, face recognition2000[8][9]R. Gross et al.
SCFaceColor images of faces at various angles.Location of facial features extracted. Coordinates of features given.4,160Images, textClassification, face recognition2011[10][11]M. Grgic et al.
YouTube Faces DBVideos of 1,595 different people gathered from YouTube. Each clip is between 48 and 6,070 frames.Identity of those appearing in videos and descriptors.3,425 videosVideo, textVideo classification, face recognition2011[12][13]L. Wolf et al.
300 videos in-the-Wild114 videos annotated for facial landmark tracking. The 68 landmark mark-up is applied to every frame.None114 videos, 218,000 frames.Video, annotation file.Facial landmark tracking.2015[14]Shen, Jie et al.
Grammatical Facial Expressions DatasetGrammatical Facial Expressions from Brazilian Sign Language.Microsoft Kinect features extracted.27,965TextFacial gesture recognition2014[15]F. Freitas et al.
CMU Face Images DatasetImages of faces. Each person is photographed multiple times to capture different expressions.Labels and features.640Images, TextFace recognition1999[16][17]T. Mitchell
Yale Face DatabaseFaces of 15 individuals in 11 different expressions.Labels of expressions.165ImagesFace recognition1997[18][19]J. Yang et al.
Cohn-Kanade AU-Coded Expression DatabaseLarge database of images with labels for expressions.Tracking of certain facial features.500+ sequencesImages, textFacial expression analysis2000[20][21]T. Kanade et al.
FaceScrubImages of public figures scrubbed from image searching.Name and m/f annotation.107,818Images, textFace recognition2014[22][23]H. Ng et al.
BioID Face DatabaseImages of faces with eye positions marked.Manually set eye positions.1521Images, textFace recognition2001[24][25]BioID
Skin Segmentation DatasetRandomly sampled color values from face images.B, G, R, values extracted.245,057TextSegmentation, classification2012[26][27]R. Bhatt.
Bosphorus3D Face image database.34 action units and 6 expressions labeled; 24 facial landmarks labeled.4652

Images, text

Face recognition, classification2008[28][29]A Savran et al.
UOY 3D-Faceneutral face, 5 expressions: anger, happiness, sadness, eyes closed, eyebrows raised.labeling.5250

Images, text

Face recognition, classification2004[30][31]University of York
CASIAExpressions: Anger, smile, laugh, surprise, closed eyes.None.4624

Images, text

Face recognition, classification2007[32][33]Institute of Automation, Chinese Academy of Sciences
CASIAExpressions: Anger Disgust Fear Happiness Sadness SurpriseNone.480Annotated Visible Spectrum and Near Infrared Video captures at 25 frames per secondFace recognition, classification2011[34]Zhao, G. et al.
BU-3DFEneutral face, and 6 expressions: anger, happiness, sadness, surprise, disgust, fear (4 levels). 3D images extracted.None.2500Images, textFacial expression recognition, classification2006[35]Binghamton University
Face Recognition Grand Challenge DatasetUp to 22 samples for each subject. Expressions: anger, happiness, sadness, surprise, disgust, puffy. 3D Data.None.4007Images, textFace recognition, classification2004[36][37]National Institute of Standards and Technology
GavabdbUp to 61 samples for each subject. Expressions neutral face, smile, frontal accentuated laugh, frontal random gesture. 3D images.None.549Images, textFace recognition, classification2008[38][39]King Juan Carlos University
3D-RMAUp to 100 subjects, expressions mostly neutral. Several poses as well.None.9971Images, textFace recognition, classification2004[40][41]Royal Military Academy (Belgium)

Action recognition

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Human Motion DataBase (HMDB51)51 action categories, each containing at least 101 clips, extracted from a range of sources.None.6,766 video clipsvideo clipsAction classification2011[42]H. Kuehne et al.
TV Human Interaction DatasetVideos from 20 different TV shows for prediction social actions: handshake, high five, hug, kiss and none.None.6,766 video clipsvideo clipsAction prediction2013[43]Patron-Perez, A. et al.
UT InteractionPeople acting out one of 6 actions (shake-hands, point, hug, push, kick, and punch) sometimes with multiple groups in the same video clip.None.120 video clipsvideo clipsAction prediction2009[44]Ryoo, M. S. et al.
UT Kinect10 different people performing one of 6 actions (walk, sit down, stand up, pick up, carry, throw, push, pull, wave hands and clap hands) in an office setting.None.200 video clips with depth information at 15 frames per secondvideo clips with depth informationAction classification2012[45]Xia, L. et al.
SBU InteractSeven participants performing one of 8 actions together (approaching, departing, pushing, kicking, punching, exchanging objects, hugging, and shaking hands) in an office setting.None.Around 300 interactionsvideo clips with depth informationAction classification2012[46]Yun, K. et al.
Berkeley Multimodal Human Action Database (MHAD)Recordings of a single person performing 12 actionsMoCap pre-processing660 action samples8 PhaseSpace Motion Cpature, 2 Stereo Cameras, 4 Quad Cameras, 6 accelerometers, 4 microphonesAction classification2013[47]Ofli, F. et al.
UCF 101 DatasetSelf described as "a dataset of 101 human actions classes from videos in the wild." Dataset is large with over 27 hours of video.Actions classified and labeled.13,000Video, images, textClassification, action detection2012[48][49]K. Soomro et al.
THUMOS DatasetLarge video dataset for action classification.Actions classified and labeled.45M frames of videoVideo, images, textClassification, action detection2013[50][51]Y. Jiang et al.
ActivitynetLarge video dataset for activity recognition and detection.Actions classified and labeled.10,024Video, images, textClassification, action detection2015[52]Heilbron et al.
MSP-AVATARImprovised scenarios annotated for discourse functions: contrast, confirmation/negation, question, uncertainty, suggest, giving orders, warn, inform, size description, using pronouns.Actions classified and labeled.74 sessionsMotion-captured video, audioClassification, action detection2015[53]Sadoughi, N. et al.
LILiR Twotalk CorpusVideo datasets for non-verbal communication activity recognition: agreement, thinking, asking and understanding.Actions classified and labeled.527VideoAction detection2011[54]Sheerman-Chase et al.

Object detection & recognition

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
DAVIS: Densely Annotated VIdeo Segmentation150 video sequences containing 10459 frames with a total of 376 objects annotated.Dataset released for the 2017 DAVIS Challenge with a dedicated workshop co-located with CVPR 2017. The videos contain several types of objects and humans with a high quality segmentation annotation.10,459Frames annotatedVideo object segmentation2017[55]Pont-Tuset, J. et al.
T-LESS: An RGB-D Dataset for 6D Pose Estimation of Texture-less Objects30 industry-relevant objects. 39K training and 10K test images from each of three sensors. Two types of 3D models for each object.6D poses for all modeled objects in all images. Per-pixel labelling can be obtained by rendering of the object models at the ground truth poses.49,000RGB-D images, 3D object models6D object pose estimation, object detection2017[56]T. Hodan et al.
Berkeley 3-D Object Dataset849 images taken in 75 different scenes. About 50 different object classes are labeled.Object bounding boxes and labeling.849labeled images, textObject recognition2014[57][58]A. Janoch et al.
Berkeley Segmentation Data Set and Benchmarks 500 (BSDS500)500 natural images, explicitly separated into disjoint train, validation and test subsets + benchmarking code. Based on BSDS300.Each image segmented by five different subjects on average.500Segmented imagesContour detection and hierarchical image segmentation2011[59]University of California, Berkeley
Microsoft Common Objects in Context (COCO)complex everyday scenes of common objects in their natural context.Object highlighting, labeling, and classification into 91 object types.2,500,000Labeled images, textObject recognition2015[60][61]T. Lin et al.
SUN DatabaseVery large scene and object recognition database.Places and objects are labeled. Objects are segmented.131,067Images, textObject recognition, scene recognition2014[62][63]J. Xiao et al.
ImageNetLabeled object image database, used in the ImageNet Large Scale Visual Recognition ChallengeLabeled objects, bounding boxes, descriptive words, SIFT features14,197,122Images, textObject recognition, scene recognition2014[64][65]J. Deng et al.
TV News Channel Commercial Detection DatasetTV commercials and news broadcasts.Audio and video features extracted from still images.129,685TextClustering, classification2015[66][67]P. Guha et al.
Statlog (Image Segmentation) DatasetThe instances were drawn randomly from a database of 7 outdoor images and hand-segmented to create a classification for every pixel.Many features calculated.2310TextClassification1990[68]University of Massachusetts
Caltech 101Pictures of objects.Detailed object outlines marked.9146ImagesClassification, object recognition.2003[69][70]F. Li et al.
Caltech-256Large dataset of images for object classification.Images categorized and hand-sorted.30,607Images, TextClassification, object detection2007[71][72]G. Griffin et al.
SIFT10M DatasetSIFT features of Caltech-256 dataset.Extensive SIFT feature extraction.11,164,866TextClassification, object detection2016[73]X. Fu et al.
LabelMeAnnotated pictures of scenes.Objects outlined.187,240Images, textClassification, object detection2005[74]MIT Computer Science and Artificial Intelligence Laboratory
Cityscapes DatasetStereo video sequences recorded in street scenes, with pixel-level annotations. Metadata also included.Pixel-level segmentation and labeling25,000Images, textClassification, object detection2016[75]Daimler AG et al.
PASCAL VOC DatasetLarge number of images for classification tasks.Labeling, bounding box included500,000Images, textClassification, object detection2010[76][77]M. Everingham et al.
CIFAR-10 DatasetMany small, low-resolution, images of 10 classes of objects.Classes labelled, training set splits created.60,000ImagesClassification2009[65][78]A. Krizhevsky et al.
CIFAR-100 DatasetLike CIFAR-10, above, but 100 classes of objects are given.Classes labelled, training set splits created.60,000ImagesClassification2009[65][78]A. Krizhevsky et al.
German Traffic Sign Detection Benchmark DatasetImages from vehicles of traffic signs on German roads. These signs comply with UN standards and therefore are the same as in other countries.Signs manually labeled900ImagesClassification2013[79][80]S Houben et al.
KITTI Vision Benchmark DatasetAutonomous vehicles driving through a mid-size city captured images of various areas using cameras and laser scanners.Many benchmarks extracted from data.>100 GB of dataImages, textClassification, object detection2012[81][82]A Geiger et al.

Handwriting and character recognition

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Artificial Characters DatasetArtificially generated data describing the structure of 10 capital English letters.Coordinates of lines drawn given as integers. Various other features.6000TextHandwriting recognition, classification1992[83]H. Guvenir et al.
Letter DatasetUpper case printed letters.17 features are extracted from all images.20,000TextOCR, classification1991[84][85]D. Slate et al.
Character Trajectories DatasetLabeled samples of pen tip trajectories for people writing simple characters.3-dimensional pen tip velocity trajectory matrix for each sample2858TextHandwriting recognition, classification2008[86][87]B. Williams
Chars74K DatasetCharacter recognition in natural images of symbols used in both English and Kannada74,107Character recognition, handwriting recognition, OCR, classification2009[88]T. de Campos
UJI Pen Characters DatasetIsolated handwritten charactersCoordinates of pen position as characters were written given.11,640TextHandwriting recognition, classification2009[89][90]F. Prat et al.
Gisette DatasetHandwriting samples from the often-confused 4 and 9 characters.Features extracted from images, split into train/test, handwriting images size-normalized.13,500Images, textHandwriting recognition, classification2003[91]Yann LeCun et al.
MNIST DatabaseDatabase of handwritten digits.Hand-labeled.60,000Images, textClassification1998[92][93]National Institute of Standards and Technology
Optical Recognition of Handwritten Digits DatasetNormalized bitmaps of handwritten data.Size normalized and mapped to bitmaps.5620Images, textHandwriting recognition, classification1998[94]E. Alpaydin et al.
Pen-Based Recognition of Handwritten Digits DatasetHandwritten digits on electronic pen-tablet.Feature vectors extracted to be uniformly spaced.10,992Images, textHandwriting recognition, classification1998[95][96]E. Alpaydin et al.
Semeion Handwritten Digit DatasetHandwritten digits from 80 people.All handwritten digits have been normalized for size and mapped to the same grid.1593Images, textHandwriting recognition, classification2008[97]T. Srl
HASYv2Handwritten mathematical symbolsAll symbols are centered and of size 32px x 32px.168233Images, textClassification2017[98]Martin Thoma

Aerial images

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Aerial Image Segmentation Dataset80 high-resolution aerial images with spatial resolution ranging from 0.3 to 1.0.Images manually segmented.80ImagesAerial Classification, object detection2013[99][100]J. Yuan et al.
KIT AIS Data SetMultiple labeled training and evaluation datasets of aerial images of crowds.Images manually labeled to show paths of individuals through crowds.~ 150Images with pathsPeople tracking, aerial tracking2012[101][102]M. Butenuth et al.
Wilt DatasetRemote sensing data of diseased trees and other land cover.Various features extracted.4899ImagesClassification, aerial object detection2014[103][104]B. Johnson
Forest Type Mapping DatasetSatellite imagery of forests in Japan.Image wavelength bands extracted.326TextClassification2015[105][106]B. Johnson
Overhead Imagery Research Data SetAnnotated overhead imagery. Images with multiple objects.Over 30 annotations and over 60 statistics that describe the target within the context of the image.1000Images, textClassification2009[107][108]F. Tanner et al.

Other images[edit]

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
MPII Cooking Activities DatasetVideos and images of various cooking activities.Activity paths and directions, labels, fine-grained motion labeling, activity class, still image extraction and labeling.881,755 framesLabeled video, images, textClassification2012[109][110]M. Rohrbach et al.
Stanford Dogs DatasetImages of 120 breeds of dogs from around the world.Train/test splits and ImageNet annotations provided.20,580Images, textFine-grain classification2011[111][112]A. Khosla et al.
The Oxford-IIIT Pet Dataset37 categories of pets with roughly 200 images of each.Breed labeled, tight bounding box, foreground-background segmentation.~ 7,400Images, textClassification, object detection2012[112][113]O. Parkhi et al.
Corel Image Features Data SetDatabase of images with features extracted.Many features including color histogram, co-occurrence texture, and colormoments,68,040TextClassification, object detection1999[114][115]M. Ortega-Bindenberger et al.
Online Video Characteristics and Transcoding Time Dataset.Transcoding times for various different videos and video properties.Video features given.168,286TextRegression2015[116]T. Deneke et al.
Microsoft Sequential Image Narrative Dataset (SIND)Dataset for sequential vision-to-languageDescriptive caption and storytelling given for each photo, and photos are arranged in sequences81,743Images, textVisual storytelling2016[117]Microsoft Research
Caltech-UCSD Birds-200-2011 DatasetLarge dataset of images of birds.Part locations for birds, bounding boxes, 312 binary attributes given11,788Images, textClassification2011[118][119]C. Wah et al.
YouTube-8MLarge and diverse labeled video datasetYouTube video IDs and associated labels from a diverse vocabulary of 4800 visual entities8 millionVideo, textVideo classification2016[120][121]S. Abu-El-Haija et al.
YFCC100MLarge and diverse labeled image and video datasetFlickr Videos and Images and associated description, titles, tags, and other metadata (such as EXIF and geotags)100 millionVideo, Image, TextVideo and Image classification2016[122][123]B. Thomee et al.
Discrete LIRIS-ACCEDEShort videos annotated for valence and arousal.Valence and arousal labels.9800VideoVideo emotion elicitation detection2015[124]Y. Baveye et al.
Continuous LIRIS-ACCEDELong videos annotated for valence and arousal while also collecting Galvanic Skin Response.Valence and arousal labels.30VideoVideo emotion elicitation detection2015[125]Y. Baveye et al.
MediaEval LIRIS-ACCEDEExtension of Discrete LIRIS-ACCEDE including annotations for violence levels of the films.Vioence, valence and arousal labels.10900VideoVideo emotion elicitation detection2015[126]Y. Baveye et al.

Text data[edit]

Datasets consisting primarily of text for tasks such as natural language processing, sentiment analysis, translation, and cluster analysis.


Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Amazon reviewsUS product reviews from Amazon.com.None.~ 82MTextClassification, sentiment analysis2015[127]McAuley et al.
OpinRank Review DatasetReviews of cars and hotels from Edmunds.com and TripAdvisor respectively.None.42,230 / ~259,000 respectivelyTextSentiment analysis, clustering2011[128][129]K. Ganesan et al.
MovieLens22,000,000 ratings and 580,000 tags applied to 33,000 movies by 240,000 users.None.~ 22MTextRegression, clustering, classification2016[130]GroupLens Research
Yahoo! Music User Ratings of Musical ArtistsOver 10M ratings of artists by Yahoo users.None described.~ 10MTextClustering, regression2004[131][132]Yahoo!
Car Evaluation Data SetCar properties and their overall acceptability.Six categorical features given.1728TextClassification1997[133][134]M. Bohanec
YouTube Comedy Slam Preference DatasetUser vote data for pairs of videos shown on YouTube. Users voted on funnier videos.Video metadata given.1,138,562TextClassification2012[135][136]Google
Skytrax User Reviews DatasetUser reviews of airlines, airports, seats, and lounges from Skytrax.Ratings are fine-grain and include many aspects of airport experience.41396TextClassification, regression2015[137]Q. Nguyen
Teaching Assistant Evaluation DatasetTeaching assistant reviews.Features of each instance such as class, class size, and instructor are given.151TextClassification1997[138][139]W. Loh et al.

News articles[edit]

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
NYSK DatasetEnglish news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn.Filtered and presented in XML format.10,421XML, textSentiment analysis, topic extraction2013[140]Dermouche, M. et al.
The Reuters Corpus Volume 1Large corpus of Reuters news stories in English.Fine-grain categorization and topic codes.810,000TextClassification, clustering, summarization2002[141]Reuters
The Reuters Corpus Volume 2Large corpus of Reuters news stories in multiple languages.Fine-grain categorization and topic codes.487,000TextClassification, clustering, summarization2005[142]Reuters
Thomson Reuters Text Research CollectionLarge corpus of news stories.Details not described.1,800,370TextClassification, clustering, summarization2009[143]T. Rose et al.
Saudi Newspapers Corpus31,030 Arabic newspaper articles.Metadata extracted.31,030JSONSummarization, clustering2015[144]M. Alhagri


Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Enron Email DatasetEmails from employees at Enron organized into folders.Attachments removed, invalid email addresses converted to user@enron.com or no_address@enron.com.~ 500,000TextNetwork analysis, sentiment analysis2004 (2015)[145][146]Klimt, B. and Y. Yang
Ling-Spam DatasetCorpus containing both legitimate and spam emails.Four version of the corpus involving whether or not a lemmatiser or stop-list was enabled.TextClassification2000[147][148]Androutsopoulos, J. et al.
SMS Spam Collection DatasetCollected SMS spam messages.None.5574TextClassification2011[149][150]T. Almeida et al.
Twenty Newsgroups DatasetMessages from 20 different newsgroups.None.20,000TextNatural language processing1999[151]T. Mitchell et al.
Spambase DatasetSpam emails.Many text features extracted.4601TextSpam detection, classification1999[152]M. Hopkins et al.

Twitter and tweets[edit]

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Sentiment140Tweet data from 2009 including original text, time stamp, user and sentiment.Classified using distant supervision from presence of emoticon in tweet.1,578,627Tweets, comma, separated valuesSentiment analysis2009[153][154]A. Go et al.
ASU Twitter DatasetTwitter network data, not actual tweets. Shows connections between a large number of users.None.11,316,811 users, 85,331,846 connectionsTextClustering, graph analysis2009[155][156]R. Zafarani et al.
SNAP Social Circles: Twitter DatabaseLarge twitter network data.Node features, circles, and ego networks.1,768,149TextClustering, graph analysis2012[157][158]J. McAuley et al.
Twitter Dataset for Arabic Sentiment AnalysisArabic tweets.Samples hand-labeled as positive or negative.2000TextClassification2014[159][160]N. Abdulla
Buzz in Social Media DatasetData from Twitter and Tom's Hardware. This dataset focuses on specific buzz topics being discussed on those sites.Data is windowed so that the user can attempt to predict the events leading up to social media buzz.140,000TextRegression, Classification2013[161][162]F. Kawala et al.
Paraphrase and Semantic Similarity in Twitter (PIT)This dataset focuses on whether tweets have (almost) same meaning/information or not. Manually labeled.tokenization, part-of-speech and named entity tagging18,762TextRegression, Classification2015[163][164]Xu et al.

Other text[edit]

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Legal Case ReportsFederal Court of Australia cases from 2006–2009.None.4,000TextSummarization,

citation analysis

2012[165][166]F. Galgani et al.
Blogger Authorship CorpusBlog entries of 19,320 people from blogger.com.Blogger self-provided gender, age, industry, and astrological sign.681,288TextSentiment analysis, summarization, classification2006[167][168]J. Schler et al.
Social Structure of Facebook NetworksLarge dataset of the social structure of Facebook.None.100 colleges coveredTextNetwork analysis, clustering2012[169][170]A. Traud et al.
Dataset for the Machine Comprehension of TextStories and associated questions for testing comprehension of text.None.660TextNatural language processing, machine comprehension2013[171][172]M. Richardson et al.
The Penn Treebank ProjectNaturally occurring text annotated for linguistic structure.Text is parsed into semantic trees.~ 1M wordsTextNatural language processing, summarization1995[173][174]M. Marcus et al.
DEXTER DatasetTask given is to determine, from features given, which articles are about corporate acquisitions.Features extracted include word stems. Distractor features included.2600TextClassification2008[175]Reuters
Google Books N-gramsN-grams from a very large corpus of booksNone.2.2 TB of textTextClassification, clustering, regression2011[176][177]Google
Personae CorpusCollected for experiments in Authorship Attribution and Personality Prediction. Consists of 145 Dutch-language essays.In addition to normal texts, syntactically annotated texts are given.145TextClassification, regression2008[178][179]K. Luyckx et al.
CNAE-9 DatasetCategorization task for free text descriptions of Brazilian companies.Word frequency has been extracted.1080TextClassification2012[180][181]P. Ciarelli et al.
Sentiment Labeled Sentences Dataset3000 sentiment labeled sentences.Sentiment of each sentence has been hand labeled as positive or negative.3000TextClassification, sentiment analysis2015[182][183]D. Kotzias
BlogFeedback DatasetDataset to predict the number of comments a post will receive based on features of that post.Many features of each post extracted.60,021TextRegression2014[184][185]K. Buza
Stanford Natural Language Inference (SNLI) CorpusImage captions matched with newly constructed sentences to form entailment, contradiction, or neutral pairs.Entailment class labels, syntactic parsing by the Stanford PCFG parser570,000TextNatural language inference/recognizing textual entailment2015[186]S. Bowman et al.

Sound data[edit]

Datasets of sounds and sound features.


Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Zero Resource Speech Challenge 2015Spontaneous speech (English), Read speech (Xitsonga).raw wavEnglish: 5h, 12 speakers; Xitsonga: 2h30; 24 speakerssoundUnsupervised discovery of speech features/subword units/word units2015[187][188]www.zerospeech.com/2015Versteegh et al.
Parkinson Speech DatasetMultiple recordings of people with and without Parkinson's Disease.Voice features extracted, disease scored by physician using unified Parkinson's disease rating scale1,040TextClassification, regression2013[189][190]B. E. Sakar et al.
Spoken Arabic DigitsSpoken Arabic digits from 44 male and 44 female.Time-series of mel-frequency cepstrum coefficients.8,800TextClassification2010[191][192]M. Bedda et al.
ISOLET DatasetSpoken letter names.Features extracted from sounds.7797TextClassification1994[193][194]R. Cole et al.
Japanese Vowels DatasetNine male speakers uttered two Japanese vowels successively.Applied 12-degree linear prediction analysis to it to obtain a discrete-time series with 12 cepstrum coefficients.640TextClassification1999[195][196]M. Kudo et al.
Parkinson's Telemonitoring DatasetMultiple recordings of people with and without Parkinson's Disease.Sound features extracted.5875TextClassification2009[197][198]A. Tsanas et al.
TIMITRecordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences.Speech is lexically and phonemically transcribed.6300TextSpeech recognition, classification.1986[199][200]J. Garofolo et al.
Arabic Speech CorpusA single-speaker, Modern Standard Arabic (MSA) speech corpus with phonetic and orthographic transcripts aligned to phoneme levelSpeech is orthographically and phonetically transcribed with stress marks.~1900Text, WAVSpeech Synthesis, Speech Recognition, Corpus Alignment, Speech Therapy, Education.2016[201]N. Halabi


Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Geographical Original of Music Data SetAudio features of music samples from different locations.Audio features extracted using MARSYAS software.1,059TextGeographical classification, clustering2014[202][203]F. Zhou et al.
Million Song DatasetAudio features from one million different songs.Audio features extracted.1MTextClassification, clustering2011[204][205]T. Bertin-Mahieux et al.
Free Music ArchiveAudio under Creative Commons from 100k songs (343 days, 1TiB) with a hierarchy of 161 genres, metadata, user data, free-form text.Raw audio and audio features.106,574Text, MP3Classification, recommendation2017[206]M. Defferrard et al.
Bach Choral Harmony DatasetBach chorale chords.Audio features extracted.5665TextClassification2014[207][208]D. Radicioni et al.

Other sounds[edit]

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
UrbanSoundLabeled sound recordings of sounds like air conditioners, car horns and children playing.Sorted into folders by class of events as well as metadata in a JSON file and annotations in a CSV file.1,059Sound


Classification2014[209][210]J. Salamon et al.

Signal data[edit]

Datasets containing electric signal information requiring some sort of Signal processing for further analysis.


Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Witty Worm DatasetDataset detailing the spread of the Witty worm and the infected computers.Split into a publicly available set and a restricted set containing more sensitive information like IP and UDP headers.55,909 IP addressesTextClassification2004[211][212]Center for Applied Internet Data Analysis
Cuff-Less Blood Pressure Estimation DatasetCleaned vital signals from human patients which can be used to estimate blood pressure.125 Hz vital signs have been cleaned.12,000TextClassification, regression2015[213][214]M. Kachuee et al.
Gas Sensor Array Drift DatasetMeasurements from 16 chemical sensors utilized in simulations for drift compensation.Extensive number of features given.13,910TextClassification2012[215][216]A. Vergara
Servo DatasetData covering the nonlinear relationships observed in a servo-amplifier circuit.Levels of various components as a function of other components are given.167TextRegression1993[217][218]K. Ullrich
UJIIndoorLoc-Mag DatasetIndoor localization database to test indoor positioning systems. Data is magnetic field based.Train and test splits given.40,000TextClassification, regression, clustering2015[219][220]D. Rambla et al.
Sensorless Drive Diagnosis DatasetElectrical signals from motors with defective components.Statistical features extracted.58,508TextClassification2015[221][222]M. Bator


Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Wearable Computing: Classification of Body Postures and Movements (PUC-Rio)People performing five standard actions while wearing motion tackers.None.165,632TextClassification2013[223][224]Pontifical Catholic University of Rio de Janeiro
Gesture Phase Segmentation DatasetFeatures extracted from video of people doing various gestures.Features extracted aim at studying gesture phase segmentation.9900TextClassification, clustering2014[225][226]R. Madeo et a
Vicon Physical Action Data Set Dataset10 normal and 10 aggressive physical actions that measure the human activity tracked by a 3D tracker.Many parameters recorded by 3D tracker.3000TextClassification2011[227][228]T. Theodoridis
Daily and Sports Activities DatasetMotor sensor data for 19 daily and sports activities.Many sensors given, no preprocessing done on signals.9120TextClassification2013[229][230]B. Barshan et al.
Human Activity Recognition Using Smartphones DatasetGyroscope and accelerometer data from people wearing smartphones and performing normal actions.Actions performed are labeled, all signals preprocessed for noise.10,299TextClassification2012[231][232]J. Reyes-Ortiz et al.
Australian Sign Language SignsAustralian sign language signs captured by motion-tracking gloves.None.2565TextClassification2002[233][234]M. Kadous
Weight Lifting Exercises monitored with Inertial Measurement UnitsFive variations of the biceps curl exercise monitored with IMUs.Some statistics calculated from raw data.39,242TextClassification2013[235][236]W. Ugulino et al.
sEMG for Basic Hand movements DatasetTwo databases of surface electromyographic signals of 6 hand movements.None.3000TextClassification2014[237][238]C. Sapsanis et al.
REALDISP Activity Recognition DatasetEvaluate techniques dealing with the effects of sensor displacement in wearable activity recognition.None.1419TextClassification2014[238][239]O. Banos et al.
Heterogeneity Activity Recognition DatasetData from multiple different smart devices for humans performing various activities.None.43,930,257TextClassification, clustering2015[240][241]A. Stisen et al.
Indoor User Movement Prediction from RSS DataTemporal wireless network data that can be used to track the movement of people in an office.None.13,197TextClassification2016[242][243]D. Bacciu
PAMAP2 Physical Activity Monitoring Dataset18 different types of physical activities performed by 9 subjects wearing 3 IMUs.None.3,850,505TextClassification2012[244]A. Reiss
OPPORTUNITY Activity Recognition DatasetHuman Activity Recognition from wearable, object, and ambient sensors is a dataset devised to benchmark human activity recognition algorithms.None.2551TextClassification2012[245][246]D. Roggen et al.
Real World Activity Recognition DatasetHuman Activity Recognition from wearable devices. Distinguishes between seven on-body device positions and comprises six different kinds of sensors.None.3,150,000 (per sensor)TextClassification2016[247]T. Sztyler et al.

Other signals[edit]

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Wine DatasetChemical analysis of wines grown in the same region in Italy but derived from three different cultivars.13 properties of each wine are given178TextClassification, regression1991[248][249]M. Forina et al.
Combined Cycle Power Plant Data SetData from various sensors within a power plant running for 6 years.None9568TextRegression2014[250][251]P. Tufekci et al.

Physical data[edit]

Datasets from physical systems

High-energy physics[edit]

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
HIGGS DatasetMonte Carlo simulations of particle accelerator collisions.28 features of each collision are given.11MTextClassification2014[252][253][254]D. Whiteson
HEPMASS DatasetMonte Carlo simulations of particle accelerator collisions. Goal is to separate the signal from noise.28 features of each collision are given.10,500,000TextClassification2016[253][254][255]D. Whiteson


Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Yacht Hydrodynamics DatasetYacht performance based on dimensions.Six features are given for each yacht.308TextRegression2013[256][257]R. Lopez
Robot Execution Failures Dataset5 data sets that center around robotic failure to execute common tasks.Integer valued features such as torque and other sensor measurements.463TextClassification1999[258]L. Seabra et al.
Pittsburgh Bridges DatasetDesign description is given in terms of several properties of various bridges.Various bridge features are given.108TextClassification1990[259][260]Y. Reich et al.
Automobile DatasetData about automobiles, their insurance risk, and their normalized losses.Car features extracted.205TextRegression1987[261][262]J. Schimmer et al.
Auto MPG DatasetMPG data for cars.Eight features of each car given.398TextRegression1993[263]Carnegie Mellon University
Energy Efficiency DatasetHeating and cooling requirements given as a function of building parameters.Building parameters given.768TextClassification, regression2012[264][265]A. Xifara et al.
Airfoil Self-Noise DatasetA series of aerodynamic and acoustic tests of two and three-dimensional airfoil blade sections.Data about frequency, angle of attack, etc., are given.1503TextRegression2014[266]R. Lopez
Challenger USA Space Shuttle O-Ring DatasetAttempt to predict O-ring problems given past Challenger data.Several features of each flight, such as launch temperature, are given.23TextRegression1993[267][268]D. Draper et al.
Statlog (Shuttle) DatasetNASA space shuttle datasets.Nine features given.58,000TextClassification2002[269]NASA


Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Volcanoes on Venus – JARtool experiment DatasetVenus images returned by the Magellan spacecraft.Images are labeled by humans.not givenImagesClassification1991[270][271]M. Burl
MAGIC Gamma Telescope DatasetMonte Carlo generated high-energy gamma particle events.Numerous features extracted from the simulations.19,020TextClassification2007[271][272]R. Bock
Solar Flare DatasetMeasurements of the number of certain types of solar flare events occurring in a 24-hour period.Many solar flare-specific features are given.1389TextRegression, classification1989[273]G. Bradshaw

Earth science[edit]

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Volcanoes of the WorldVolcanic eruption data for all known volcanic events on earth.Details such as region, subregion, tectonic setting, dominant rock type are given.1535TextRegression, classification2013[274]E. Venzke et al.
Seismic-bumps DatasetSeismic activities from a coal mine.Seismic activity was classified as hazardous or not.2584TextClassification2013[275][276]M. Sikora et al.

Other physical[edit]

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Concrete Compressive Strength DatasetDataset of concrete properties and compressive strength.Nine features are given for each sample.1030TextRegression2007[277][278]I. Yeh
Concrete Slump Test DatasetConcrete slump flow given in terms of properties.Features of concrete given such as fly ash, water, etc.103TextRegression2009[279][280]I. Yeh
Musk DatasetPredict if a molecule, given the features, will be a musk or a non-musk.168 features given for each molecule.6598TextClassification1994[281]Arris Pharmaceutical Corp.
Steel Plates Faults DatasetSteel plates of 7 different types.27 features given for each sample.1941TextClassification2010[282]Semeion Research Center

Biological data[edit]

Datasets from biological systems.


Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
EEG DatabaseStudy to examine EEG correlates of genetic predisposition to alcoholism.Measurements from 64 electrodes placed on the scalp sampled at 256 Hz (3.9 ms epoch) for 1 second.122TextClassification1999[283][284]H. Begleiter
P300 Interface DatasetData from nine subjects collected using P300-based brain-computer interface for disabled subjects.Split into four sessions for each subject. MATLAB code given.1,224TextClassification2008[285][286]U. Hoffman et al.
Heart Disease Data SetAttributed of patients with and without heart disease.75 attributes given for each patient with some missing values.303TextClassification1988[287][288]A. Janosi et al.
Breast Cancer Wisconsin (Diagnostic) DatasetDataset of features of breast masses. Diagnoses by physician is given.10 features for each sample are given.569TextClassification1995[289][290]W. Wolberg et al.
National Survey on Drug Use and HealthLarge scale survey on health and drug use in the United States.None.55,268TextClassification, regression2012[291]United States Department of Health and Human Services
Lung Cancer DatasetLung cancer dataset without attribute definitions56 features are given for each case32TextClassification1992[292][293]Z. Hong et al.
Arrhythmia DatasetData for a group of patients, of which some have cardiac arrhythmia.276 features for each instance.452TextClassification1998[294][295]H. Altay et al.
Diabetes 130-US hospitals for years 1999–2008 Dataset9 years of readmission data across 130 US hospitals for patients with diabetes.Many features of each readmission are given.100,000TextClassification, clustering2014[296][297]J. Clore et al.
Diabetic Retinopathy Debrecen DatasetFeatures extracted from images of eyes with and without diabetic retinopathy.Features extracted and conditions diagnosed.1151TextClassification2014[298][299]B. Antal et al.
Liver Disorders DatasetData for people with liver disorders.Seven biological features given for each patient.345TextClassification1990[300][301]Bupa Medical Research Ltd.
Thyroid Disease Dataset10 databases of thyroid disease patient data.None.7200TextClassification1987[302][303]R. Quinlan
Mesothelioma DatasetMesothelioma patient data.Large number of features, including asbestos exposure, are given.324TextClassification2016[304][305]A. Tanrikulu et al.
KEGG Metabolic Reaction Network (Undirected) DatasetNetwork of metabolic pathways. A reaction network and a relation network are given.Detailed features for each network node and pathway are given.65,554TextClassification, clustering, regression2011[306]M. Naeem et al.


Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Abalone DatasetPhysical measurements of Abalone. Weather patterns and location are also given.None.4177TextRegression1995[307]Marine Research Laboratories – Taroona
Zoo DatasetArtificial dataset covering 7 classes of animals.Animals are classed into 7 categories and features are given for each.101TextClassification1990[308]R. Forsyth
Demospongiae DatasetData about marine sponges.503 sponges in the Demosponge class are described by various features.503TextClassification2010[309]E. Armengol et al.
Splice-junction Gene Sequences DatasetPrimate splice-junction gene sequences (DNA) with associated imperfect domain theory.None.3190TextClassification1992[293]G. Towell et al.
Mice Protein Expression DatasetExpression levels of 77 proteins measured in the cerebral cortex of mice.None.1080TextClassification, Clustering2015[310][311]C. Higuera et al.


Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Forest Fires DatasetForest fires and their properties.13 features of each fire are extracted.517TextRegression2008[312][313]P. Cortez et al.
Iris DatasetThree types of iris plants are described by 4 different attributes.None.150TextClassification1936[314][315]R. Fisher
Plant Species Leaves DatasetSixteen samples of leaf each of one-hundred plant species.Shape descriptor, fine-scale margin, and texture histograms are given.1600TextClassification2012[316][317]J. Cope et al.
Mushroom DatasetMushroom attributes and classification.Many properties of each mushroom are given.8124TextClassification1987[318]J. Schlimmer
Soybean DatasetDatabase of diseased soybean plants.35 features for each plant are given. Plants are classified into 19 categories.307TextClassification1988[319]R. Michalshi et al.
Seeds DatasetMeasurements of geometrical properties of kernels belonging to three different varieties of wheat.None.210TextClassification, clustering2012[320][321]Charytanowicz et al.
Covertype DatasetData for predicting forest cover type strictly from cartographic variables.Many geographical features given.581,012TextClassification1998[322][323]J. Blackard et al.
Abscisic Acid Signaling Network DatasetData for a plant signaling network. Goal is to determine set of rules that governs the network.None.300TextCausal-discovery2008[324]J. Jenkens et al.
Folio Dataset20 photos of leaves for each of 32 species.None.637Images, textClassification, clustering2015[325][326]T. Munisami et al.
Oxford Flower Dataset17 category dataset of flowers.Train/test splits, labeled images,1360Images, textClassification2006[113][327]M-E Nilsback et al.


Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Ecoli DatasetProtein localization sites.Various features of the protein localizations sites are given.336TextClassification1996[328][329]K. Nakai et al.
MicroMass DatasetIdentification of microorganisms from mass-spectrometry data.Various mass spectrometer features.931TextClassification2013[330][331]P. Mahe et al.
Yeast DatasetPredictions of Cellular localization sites of proteins.Eight features given per instance.1484TextClassification1996[332][333]K. Nakai et al.

Drug Discovery[edit]

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Tox21 DatasetPrediction of outcome of biological assays.Chemical descriptors of molecules are given.12707TextClassification2016[334]A. Mayr et al.

Anomaly data[edit]

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Numenta Anomaly Benchmark (NAB)Data are ordered, timestamped, single-valued metrics. All data files contain anomalies, unless otherwise noted.?50+ filesComma separated valuesAnomaly detection2016 (continually updated)[335]Numenta

Multivariate data[edit]

Datasets consisting of rows of observations and columns of attributes characterizing those observations. Typically used for regression analysis or classification but other types of algorithms can also be used. This section includes datasets that do not fit in the above categories.


Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Dow Jones IndexWeekly data of stocks from the first and second quarters of 2011.Calculated values included such as percentage change and a lags.750Comma separated valuesClassification, regression, time Series2014[336][337]M. Brown et al.
Statlog (Australian Credit Approval)Credit card applications either accepted or rejected and attributes about the application.Attribute names are removed as well as identifying information. Factors have been relabeled.690Comma separated valuesClassification1987[338][339]R. Quinlan
eBay auction dataAuction data from various eBay.com objects over various length auctionsContains all bids, bidderID, bid times, and opening prices.~ 550TextRegression, classification2012[340][341]G. Shmueli et al.
Statlog (German Credit Data)Binary credit classification into "good" or "bad" with many featuresVarious financial features of each person are given.690TextClassification1994[342]H. Hofmann
Bank Marketing DatasetData from a large marketing campaign carried out by a large bank .Many attributes of the clients contacted are given. If the client subscribed to the bank is also given.45,211TextClassification2012[343][344]S. Moro et al.
Istanbul Stock Exchange DatasetSeveral stock indexes tracked for almost two years.None.536TextClassification, regression2013[345][346]O. Akbilgic
Default of Credit Card ClientsCredit default data for Taiwanese creditors.Various features about each account are given.30,000TextClassification2016[347][348]I. Yeh


Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Cloud DataSetData about 1024 different clouds.Image features extracted.1024TextClassification, clustering1989[349]P. Collard
El Nino DatasetOceanographic and surface meteorological readings taken from a series of buoys positioned throughout the equatorial Pacific.12 weather attributes are measured at each buoy.178080TextRegression1999[350]Pacific Marine Environmental Laboratory
Greenhouse Gas Observing Network DatasetTime-series of greenhouse gas concentrations at 2921 grid cells in California created using simulations of the weather.None.2921TextRegression2015[351]D. Lucas
Atmospheric CO2 from Continuous Air Samples at Mauna Loa ObservatoryContinuous air samples in Hawaii, USA. 44 years of records.None.44 yearsTextRegression2001[352]Mauna Loa Observatory
Ionosphere DatasetRadar data from the ionosphere. Task is to classify into good and bad radar returns.Many radar features given.351TextClassification1989[303][353]Johns Hopkins University
Ozone Level Detection DatasetTwo ground ozone level datasets.Many features given, including weather conditions at time of measurement.2536TextClassification2008[354][355]K. Zhang et al.


Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Adult DatasetCensus data from 1994 containing demographic features of adults and their income.Cleaned and anonymized.48,842Comma separated valuesClassification1996[356]United States Census Bureau
Census-Income (KDD)Weighted census data from the 1994 and 1995 Current Population Surveys.Split into training and test sets.299,285Comma separated valuesClassification2000[357][358]United States Census Bureau
IPUMS Census DatabaseCensus data from the Los Angeles and Long Beach areas.None256,932TextClassification, regression1999[359]IPUMS
US Census Data 1990Partial data from 1990 US census.Results randomized and useful attributes selected.2,458,285TextClassification, regression1990[360]United States Census Bureau


Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Bike Sharing DatasetHourly and daily count of rental bikes in a large city.Many features, including weather, length of trip, etc., are given.17,389TextRegression2013[361][362]H. Fanaee-T
New York City Taxi Trip DataTrip data for yellow and green taxis in New York City.Gives pick up and drop off locations, fares, and other details of trips.6 yearsTextClassification, clustering2015[363]New York City Taxi and Limousine Commission
Taxi Service Trajectory ECML PKDDTrajectories of all taxis in a large city.Many features given, including start and stop points.1,710,671TextClustering, causal-discovery2015[364][365]M. Ferreira et al.


Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Webpages from Common Crawl 2012Large collection of webpages and how they are connected via hyperlinksNone.3.5BTextclustering, classification2013[366]V. Granville
Internet Advertisements DatasetDataset for predicting if a given image is an advertisement or not.Features encode geometry of ads and phrases occurring in the URL.3279TextClassification1998[367][368]N. Kushmerick
Internet Usage DatasetGeneral demographics of internet users.None.10,104TextClassification, clustering1999[369]D. Cook
URL Dataset120 days of URL data from a large conference.Many features of each URL are given.2,396,130TextClassification2009[370][371]J. Ma
Phishing Websites DatasetDataset of phishing websites.Many features of each site are given.2456TextClassification2015[372]R. Mustafa et al.
Online Retail DatasetOnline transactions for a UK online retailer.Details of each transaction given.541,909TextClassification, clustering2015[373]D. Chen
Freebase Simple Topic DumpFreebase is an online effort to structure all human knowledge.Topics from Freebase have been extracted.largeTextClassification, clustering2011[374][375]Freebase
Farm Ads DatasetThe text of farm ads from websites. Binary approval or disapproval by content owners is given.SVMlight sparse vectors of text words in ads calculated.4143TextClassification2011[376][377]C. Masterharm et al.


Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Poker Hand Dataset5 card hands from a standard 52 card deck.Attributes of each hand are given, including the Poker hands formed by the cards it contains.1,025,010TextRegression, classification2007[378]R. Cattral
Connect-4 DatasetContains all legal 8-ply positions in the game of connect-4 in which neither player has won yet, and in which the next move is not forced.None.67,557TextClassification1995[379]J. Tromp
Chess (King-Rook vs. King) DatasetEndgame Database for White King and Rook against Black King.None.28,056TextClassification1994[380][381]M. Bain et al.
Chess (King-Rook vs. King-Pawn) DatasetKing+Rook versus King+Pawn on a7.None.3196TextClassification1989[382]R. Holte
Tic-Tac-Toe Endgame DatasetBinary classification for win conditions in tic-tac-toe.None.958TextClassification1991[383]D. Aha

Other multivariate[edit]

Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
Housing Data SetMedian home values of Boston with associated home and neighborhood attributes.None.506TextRegression1993[384]D. Harrison et al.
The Getty Vocabulariesstructured terminology for art and other material culture, archival materials, visual surrogates, and bibliographic materials.None.largeTextClassification2015[385]Getty Center
Yahoo! Front Page Today Module User Click LogUser click log for news articles displayed in the Featured Tab of the Today Module on Yahoo! Front Page.Conjoint analysis with a bilinear model.45,811,883 user visitsTextRegression, clustering2009[386][387]Chu et al.
British Oceanographic Data CentreBiological, chemical, physical and geophysical data for oceans. 22K variables tracked.Various.22K variables, many instancesTextRegression, clustering2015[388]British Oceanographic Data Centre
Congressional Voting Records DatasetVoting data for all USA representatives on 16 issues.Beyond the raw voting data, various other features are provided.435TextClassification1987[389]J. Schlimmer
Entree Chicago Recommendation DatasetRecord of user interactions with Entree Chicago recommendation system.Details of each users usage of the app are recorded in detail.50,672TextRegression, recommendation2000[390]R. Burke
Insurance Company Benchmark (COIL 2000)Information on customers of an insurance company.Many features of each customer and the services they use.9,000TextRegression, classification2000[391][392]P. van der Putten
Nursery DatasetData from applicants to nursery schools.Data about applicant's family and various other factors included.12,960TextClassification1997[393][394]V. Rajkovic et al.
University DatasetData describing attributed of a large number of universities.None.285TextClustering, classification1988[395]S. Sounders et al.
Blood Transfusion Service Center DatasetData from blood transfusion service center. Gives data on donors return rate, frequency, etc.None.748TextClassification2008[396][397]I. Yeh
Record Linkage Comparison Patterns DatasetLarge dataset of records. Task is to link relevant records together.Blocking procedure applied to select only certain record pairs.5,749,132TextClassification2011[398][399]University of Mainz
Nomao DatasetNomao collects data about places from many different sources. Task is to detect items that describe the same place.Duplicates labeled.34,465TextClassification2012[400][401]Nomao Labs
Movie DatasetData for 10,000 movies.Several features for each movie are given.10,000TextClustering, classification1999[402]G. Wiederhold
Open University Learning Analytics DatasetInformation about students and their interactions with a virtual learning environment.None.~ 30,000TextClassification, clustering, regression2015[403][404]J. Kuzilek et al.

