Links below are some datasets that I collected from the web or papers. Feel free to play with them!

In most machine learning task, data always comes from 3 ways: provided by some organizations(public/private), generated by oneself or crawled from web pages.

Right here I provided some websites who host some public datasets for others to explore.

Within Programming Languages

R: data()

Python: scikit-learn

Datasets Repositories

Kaggle

UCI : play with Most Popular Data Sets(Recommend)

Online Recognition

IAM On-Line Handwriting Database

Image Recognition

MNIST or Kaggle

CIFAR-10 or Kaggle

ImageNet

Text Generation/Classification

RCV1 or here

20-newsgroups or here

Machine Translation

WMT’14

Machine QA

See this paper’s appendix for data generation.