mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Validating clustering output
Date Wed, 17 Jun 2009 13:14:06 GMT

On Jun 16, 2009, at 11:43 PM, Shashikant Kore wrote:

> I had hacked the code to put labels for the vectors.

OK, so we've put a lot of this in place now with MAHOUT-65.

> Then I modified
> KMeans to output the document label, Cluster ID, and distance from the
> cluster.

Do you think there is a way to make this generic for all of the  
clustering jobs?  Seems like this would be handy to have in the new  
Utils module I'm working on for MAHOUT-126 (committing today)

Care to throw up a patch as a starting point like you did for  
MAHOUT-126?

> Another utility takes this input and converts labels to the
> actual text files from which it is created.   Then I do random checks
> manually for the documents in a cluster.
>

OK, so ad hoc.  Definitely a reasonable thing to do at this point.

I wonder if we could hook into Carrot2 visualization tools at all.   
They have some really nice tools and perhaps we can output our stuff  
in a way that works for them.  I imagine Weka does too. I suppose this  
all gets back to supporting more common input/output formats.   
Although, it seems the JSON (GSON) stuff is pretty powerful that way  
too.

-Grant

Mime
View raw message