mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Richard Tomsett (JIRA)" <>
Subject [jira] Commented: (MAHOUT-59) Create some examples of clustering well-known datasets
Date Thu, 19 Feb 2009 12:34:02 GMT


Richard Tomsett commented on MAHOUT-59:

Re: discussion of text clustering on the mailing list, there are several 'bag of words' examples
at the UCI repository: . The data is in
[docID wordID wordcount] format so needs to be processed into TF-IDF Vectors for clustering.
I previously did this with a Python script but I'll write something in Hadoop to do it, before
passing it on to Canopy or K-Means clustering. May take a little while as I haven't looked
at my code for about half a year, and I didn't write unit tests or anything last time...

This would also involve writing a cosine distance measure class, which I guess would be useful
generally. Would this be a useful example?

> Create some examples of clustering well-known datasets
> ------------------------------------------------------
>                 Key: MAHOUT-59
>                 URL:
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Jeff Eastman
>         Attachments: MAHOUT-59.patch
> The existing unit tests for clustering need to be augmented with examples from the literature
which illustrate its correct operation on datasets which have known clusters present. See for some candidate datasets.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message