mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From conflue...@apache.org
Subject [CONF] Apache Mahout > Collections
Date Thu, 16 Dec 2010 17:44:00 GMT
Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Collections (https://cwiki.apache.org/confluence/display/MAHOUT/Collections)


Edited by Ted Dunning:
---------------------------------------------------------------------
TODO: Organize these somehow, add one-line blurbs
Organize by usage? (classification, recommendation etc.)

*Collections of Collections*
[ML Data|http://mldata.org/about/] ... repository supported by Pascal 2.
[DBPedia|http://wiki.dbpedia.org/Downloads30]
[UCI Machine Learning Repo|http://archive.ics.uci.edu/ml/]
[http://mloss.org/community/blog/2008/sep/19/data-sources/]
[InfoChimps|http://infochimps.com/] Free and purchasable datasets

*Categorization Data*
[20Newsgroups|http://people.csail.mit.edu/jrennie/20Newsgroups/]
[RCV1 data set|http://jmlr.csail.mit.edu/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm]
[10 years of CLEF Data|http://direct.dei.unipd.it/]

http://ece.ut.ac.ir/DBRG/Hamshahri/ (Approximately 160k categorized docs)
There is a newer beta verson here:
http://ece.ut.ac.ir/DBRG/Hamshahri/ham2/ (Approximately 320k categorized docs)

*Recommendation Data*
[Netflix Prize/Dataset|http://www.netflixprize.com/download]
[Book usage and recommendation data from the University of Huddersfield|http://library.hud.ac.uk/data/usagedata/]
[Last.fm|http://denoiserthebetter.posterous.com/music-recommendation-datasets] - Non-commercial
use only

*Multilingual Data*
[http://urd.let.rug.nl/tiedeman/OPUS/OpenSubtitles.php] - 308,000 subtitle files covering
about 18,900 movies in 59 languages (July 2006 numbers)
Note: user uploads of copyrighted content.

[Statistical Machine Translation|http://www.statmt.org/] - devoted to all things language
translation. Includes multilingual corpuses of European and Canadian legal tomes.

*Geospatial*
[Natural Earth Data|http://www.naturalearthdata.com/]
[Open Street Maps|http://wiki.openstreetmap.org/wiki/Main_Page]
And other crowd-sourced mapping data sites.

*General Resources*
[theinfo|http://theinfo.org/]
[WordNet|http://wordnet.princeton.edu/obtain]

*Stuff*
[http://www.cs.technion.ac.il/~gabr/resources/data/ne_datasets.html]

[4 Universities Data Set|http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/]

[Large crawl of Twitter|http://an.kaist.ac.kr/traces/WWW2010.html]

[UniProt|http://beta.uniprot.org/]

[http://www.icwsm.org/2009/data/]

http://data.gov
http://www.ckan.net/
http://www.guardian.co.uk/news/datablog/2010/jan/07/government-data-world
http://data.gov.uk/

Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action
   

Mime
View raw message