skip to content
User Tools
Log In
Site Tools
Search
Tools
Show pagesource
Old revisions
Log In
>
Sidebar
Overview
Applications
Quick start
Datasets
Patterns
Pattern development
Algorithms
Configuration
Best practice
FAQ
Tricks
Process
Contact
dataset
Public Relational Datasets
MySQL
(artificial data, mined)
MySQL 2
(artificial data, mined)
SQL Server
(artificial data, mined)
SQL Server 2
(artificial data)
PostgreSQL
(artificial data, mined)
w3schools
(artificial data)
Vertica
(artificial data)
Recommendation databases
Percona datasets
Graph databases
Pivotal HD
(artificial data)
Hackathon
http://www.tpc.org/tpch/
(
http://vitessedata.com/benchmark/
)
Thrombosis Database
(KDD Cup 2001)
Genes
(KDD Cup 2001)
Financial database
(PKDD 1999): 99.9% (RollUp, repeated aggregation - prored)
Medical database
(PKDD 1999): 100% (
CLAMF
)
Hepatitis database
(PKDD 2002)
Hepatitis database
(PKDD 2005, 2MB)
Mutagenesis database
(Srinivasan et al., 1996)
University database
Musk
(Biochemical): 89.20% (axis-parallel rectangles algorithms, Multiple instance problem)
Mondial database
(Geography)
Lattes curricula dataset
(Brazilian IS)
WebKB Project
(Web pages)
Swiss Insurance company
(ECML 1998, Sisyphus Workshop)
MovieLens
imdb_MovieLens
, the most complex dataset
Movie database from Standford
DBLP
(Digital Bibliography & Library Project)
Kinship
(Quinlan, 1990)
Student loan
(tiny)
E. Coli
Tuberculosis
CORA
consists of citations of computer science papers
Airline
(12GB)
CiteSeer dataset
East-West Trains
(Michalski, 1980; extended by Michie to 20 samples, 1994)
King-Rook vs. King
(Quinlan)
ILP 2005 Challenge database
(yeast gene from SwissProt, now offline)
HIV
(KDD 2001, from NCI - National Cancer Institute’s AIDS antiviral screen)
DSSTox
(Muggleton, 2006)
Youtube
UW-CSE describes an academic department
(Richardson and Domingos, 2006)
Gdelt
VOC
Proper datasets mostly in pl
Mammography (mammogram is benign or malignant [Davis, 2005])
Carcinogenesis/Carcinogenicity (not found, Muggleton, 2006, 337 chemicals)
Biodegradability
(Dzeroski, 1999)
Stanford Large Network Dataset Collection
Airlines
Amazon
Diterpenes dataset (not found, Džeroski, 1998)
National Football League (NFL)
Heart Disease
Protein dataset (SCOP)
Alzheimer’s disease
Drug-Data (pyrimidine + triazine)
Secondary structure prediction of Proteins
http://piktochart.com/6-useful-databases-to-dig-for-data/
http://meta.wikimedia.org/wiki/Data_dumps
http://data.worldbank.org/topic/environment
http://yann.lecun.com/exdb/mnist/
TPC-E benchmark
http://htsql.org/gallery/#donors-choose
(just a list - use Google to find the referenced datasets)
http://stat-computing.org/dataexpo/2009/
$yr.csv.bz2 (airline dataset, the task is to predict whether a flight will be delayed by more than 15 minutes)
http://www-ai.ijs.si/~ilpnet2/apps/
http://pgfoundry.org/projects/dbsamples/
https://github.com/AKSW/SML-Bench/tree/develop/learningtasks
Time series datasets
Cover several industries (banking, insurance, telco, utility, manufacturing, FMCG) and several classification problems (PTD, PTB, PTC, …)
Requirements:
Real datasets, not artificial
Relational
Contains time
Publicly available
ČS Česká Spořitelna
CCS Česká společnost pro platební karty, PTC
1188
Telce, PTB, PTU, PTC
Medicare
Financial database, PTD, PTB
Netflix, PTC
MovieLens, PTC
Airline
(12GB)
Stack Overflow
Git
Stanford Large Network Dataset Collection
Gdelt
VOC
Airlines
Muggleton
https://www.kaggle.com/c/telstra-recruiting-network
https://www.kaggle.com/c/rossmann-store-sales
https://www.kaggle.com/c/avito-context-ad-clicks
https://www.kaggle.com/c/predict-west-nile-virus
https://www.kaggle.com/c/facebook-recruiting-iv-human-or-bot
https://www.kaggle.com/c/walmart-recruiting-sales-in-stormy-weather
https://www.kaggle.com/c/march-machine-learning-mania-2015
https://www.kaggle.com/c/kdd-cup-2014-predicting-excitement-at-donors-choose
https://www.kaggle.com/c/acquire-valued-shoppers-challenge
https://www.kaggle.com/c/risky-business
https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting
https://www.kaggle.com/c/pakdd-cup-2014
https://www.kaggle.com/c/genentech-flu-forecasting
https://www.kaggle.com/c/ams-2014-solar-energy-prediction-contest
https://www.kaggle.com/c/job-recommendation
https://www.kaggle.com/c/predict-wordpress-likes
http://www.kddcup2012.org/hhsgov/health-insurance-marketplace
Alchemy
Animals, 2D → not imported
CiteSeer,
Cora
Epinions
IMDB
Kinships, labels are missing → not imported
Nations, matlab → imported
Protein Interaction
Radish Robot Mapping
Tutorial
UMLS
UW-CSE
WebKB, prolog
JSON like data
Predictive Web Analytics Challenge
Seznam data
Avast
Ministerstvo vnitra
Page Tools
Show pagesource
Old revisions