mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject [CONF] Apache Mahout > Collections
Date Thu, 30 Dec 2010 13:39:00 GMT
Space: Apache Mahout (
Page: Collections (

Edited by Grant Ingersoll:
TODO: Organize these somehow, add one-line blurbs
Organize by usage? (classification, recommendation etc.)

*Collections of Collections*
[ML Data|] ... repository supported by Pascal 2.
[UCI Machine Learning Repo|]
[InfoChimps|] Free and purchasable datasets
LinkedIn discussion of lots of data sets

*Categorization Data*
[RCV1 data set|]
[10 years of CLEF Data|] (Approximately 160k categorized docs)
There is a newer beta verson here: (Approximately 320k categorized docs)

*Recommendation Data*
[Netflix Prize/Dataset|]
[Book usage and recommendation data from the University of Huddersfield|]
[|] - Non-commercial
use only

*Multilingual Data*
[] - 308,000 subtitle files covering
about 18,900 movies in 59 languages (July 2006 numbers)
Note: user uploads of copyrighted content.

[Statistical Machine Translation|] - devoted to all things language
translation. Includes multilingual corpuses of European and Canadian legal tomes.

[Natural Earth Data|]
[Open Street Maps|]
And other crowd-sourced mapping data sites.

*General Resources*


[4 Universities Data Set|]

[Large crawl of Twitter|]



[Airline on-time information - 1987-2008|] - 120m
CSV records, 12G uncompressed

Change your notification preferences:

View raw message