mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wettin <karl.wet...@gmail.com>
Subject Re: Clustering Demo
Date Thu, 08 May 2008 15:56:29 GMT
Grant Ingersoll skrev:
> Anyone have any sample code or demo of running the clustering over a 
> large collection of documents that they could share?  Mainly looking for 
> an example of taking some corpus, converting it into the appropriate 
> Mahout representation and then running either the k-means or the canopy 
> clustering on it.

There is the rule based data set generation in MAHOUT-43.

http://www.datasetgenerator.com

Push a few buttons and you have an insane amount of OK test data 
according to your specifications. That is what I have been using.


There is also this contact I have with these guys that produce news 
article data for indexing. The data is nicly organized and they have 
previously offered looking in to committer access to it for local tests.

I have a number of data sets I'm not certain about who owns them. For 
instance I've been gathering real estate data for Sweden for some time 
as the sites I was using to find an appartment did not work the way I 
wanted them to :)



           karl

Mime
View raw message