mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Eastman (JIRA)" <>
Subject [jira] Commented: (MAHOUT-54) parallelize k-means sharing the predominance of canopies
Date Sun, 11 May 2008 22:07:56 GMT


Jeff Eastman commented on MAHOUT-54:

I downloaded this patch and it installed cleanly, but I have several concerns about it:

- the patch introduces an entirely new canopykmeans package without much motivation. In particular,
it is not clear what improvements it is suggesting for either canopy or kmeans
- there are no unit tests included that would indicate that the code produces correct results
- the pretty-printing rules are not those specified by ASF: the Java conventions with tabs
replaced by 2 spaces vs. 4 spaces. The patch changes several of the existing canopy files
formatting unnecessarily
- the patch introduces @author tags which are not according to ASF policy. These were likely
added by Eclipse but should be removed

I would prefer to understand the logic changes which are being suggested first, then see a
minimal patch to introduce such changes. This patch introduces an entirely new implementation
that is derived from the original version, but cannot be easily compared with it. And, it
has no associated tests.

I'm interested in understanding if logic improvements to either canopy or kmeans can be made,
but from this patch it is too difficult to understand what is being proposed. Could you please
try to be a little more systematic?

> parallelize k-means sharing the predominance of canopies
> --------------------------------------------------------
>                 Key: MAHOUT-54
>                 URL:
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.1
>         Environment: OS Independent
>            Reporter: Jeremy Chow
>             Fix For: 0.1
>         Attachments: canopykeams.patch
> The implementation of mahout at present only using canopy algorithm creating initial
cluster centroids for k-means.  It will calculate the distance from  each center to every
point while iterating. But  the most import improvement of canopies is that needs only calculating
the distance from each  center to a much smaller number of points which exists in the same

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message