Return-Path: X-Original-To: apmail-mahout-commits-archive@www.apache.org Delivered-To: apmail-mahout-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9D83B7576 for ; Wed, 26 Oct 2011 21:15:25 +0000 (UTC) Received: (qmail 90537 invoked by uid 500); 26 Oct 2011 21:15:25 -0000 Delivered-To: apmail-mahout-commits-archive@mahout.apache.org Received: (qmail 90401 invoked by uid 500); 26 Oct 2011 21:15:25 -0000 Mailing-List: contact commits-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list commits@mahout.apache.org Received: (qmail 90394 invoked by uid 99); 26 Oct 2011 21:15:25 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Oct 2011 21:15:25 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Oct 2011 21:15:21 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id p9QLF1LA021067 for ; Wed, 26 Oct 2011 21:15:01 GMT Date: Wed, 26 Oct 2011 17:15:01 -0400 (EDT) From: confluence@apache.org To: commits@mahout.apache.org Message-ID: <30062523.25224.1319663701390.JavaMail.confluence@thor> Subject: [CONF] Apache Mahout > Collections MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Auto-Submitted: auto-generated Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT) Page: Collections (https://cwiki.apache.org/confluence/display/MAHOUT/Collections) Edited by Grant Ingersoll: --------------------------------------------------------------------- TODO: Organize these somehow, add one-line blurbs Organize by usage? (classification, recommendation etc.) h2. Collections of Collections - [ML Data|http://mldata.org/about/] ... repository supported by Pascal 2. - [DBPedia|http://wiki.dbpedia.org/Downloads30] - [UCI Machine Learning Repo|http://archive.ics.uci.edu/ml/] - [http://mloss.org/community/blog/2008/sep/19/data-sources/] - [Linked Library Data|http://ckan.net/group/lld] via CKAN - [InfoChimps|http://infochimps.com/] Free and purchasable datasets - [http://www.linkedin.com/groupItem?view=&srchtype=discussedNews&gid=3638279&item=35736572&type=member&trk=EML_anet_ac_pst_ttle] LinkedIn discussion of lots of data sets h2. Categorization Data - [20Newsgroups|http://people.csail.mit.edu/jrennie/20Newsgroups/] - [RCV1 data set|http://jmlr.csail.mit.edu/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm] - [10 years of CLEF Data|http://direct.dei.unipd.it/] - http://ece.ut.ac.ir/DBRG/Hamshahri/ (Approximately 160k categorized docs) There is a newer beta verson here: http://ece.ut.ac.ir/DBRG/Hamshahri/ham2/ (Approximately 320k categorized docs) h2. Recommendation Data - [Netflix Prize/Dataset|http://www.netflixprize.com/download] - [Book usage and recommendation data from the University of Huddersfield|http://library.hud.ac.uk/data/usagedata/] - [Last.fm|http://denoiserthebetter.posterous.com/music-recommendation-datasets] - Non-commercial use only - [Amazon Product Review Data via Jindal and Liu| http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html] -- Scroll down h2. Multilingual Data - [http://urd.let.rug.nl/tiedeman/OPUS/OpenSubtitles.php] - 308,000 subtitle files covering about 18,900 movies in 59 languages (July 2006 numbers) Note: user uploads of copyrighted content. - [Statistical Machine Translation|http://www.statmt.org/] - devoted to all things language translation. Includes multilingual corpuses of European and Canadian legal tomes. h2. Geospatial - [Natural Earth Data|http://www.naturalearthdata.com/] - [Open Street Maps|http://wiki.openstreetmap.org/wiki/Main_Page] And other crowd-sourced mapping data sites. h2. Airline - [Open Flights|http://openflights.org/] - Crowd-sourced database of airlines, flights, airports, times, etc. - [Airline on-time information - 1987-2008|http://stat-computing.org/dataexpo/2009/] - 120m CSV records, 12G uncompressed h2. General Resources - [theinfo|http://theinfo.org/] - [WordNet|http://wordnet.princeton.edu/obtain] h2. Stuff - [http://www.cs.technion.ac.il/~gabr/resources/data/ne_datasets.html] - [4 Universities Data Set|http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/] - [Large crawl of Twitter|http://an.kaist.ac.kr/traces/WWW2010.html] - [UniProt|http://beta.uniprot.org/] - [http://www.icwsm.org/2009/data/] - http://data.gov - http://www.ckan.net/ - http://www.guardian.co.uk/news/datablog/2010/jan/07/government-data-world - http://data.gov.uk/ Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action