mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Re: Clustering in Mahout 0.9 candidate
Date Fri, 24 Jan 2014 22:24:11 GMT
I have a setup using hadoop M/R kmeans for testing. If I can help in any way let me know and
if you don’t get to it I’ll have a look this weekend.


On Jan 24, 2014, at 1:56 PM, Suneel Marthi <> wrote:


Andrew's not filed a JIRA for this, so thanks for filing M-1410 to track this.

The fix would be to modify ClusterIterator.iterateSeq() - (for the Sequential mode) to read
the vector key along with the vector.

For the MR mode, needs to be modified to read the vector key along with the

The aforementioned fixes should take care of both KMeans and Fuzzy KMeans clustering.

I can work on a patch later today (should have something out by tonight).

On Friday, January 24, 2014 4:47 PM, Pat Ferrel <> wrote:

Yeah, it’s not really the issue with M-1030 but makes the fix unusable. I apologize for
not noticing this sooner, my own fault I guess.

Did you file a JIRA against the larger issue? Any ETA on a fix (0.9?). Should I go ahead and
write my own cluster categorizer?

You and Suneel pointed to the problem area but I’m not sure I know the code well enough
to patch it myself. I’m building the 1.0-snapshot so If you have a suggestion I’d be happy
to try it out. I’m sort of blocked on some kind of fix for it.


On Jan 24, 2014, at 10:46 AM, Andrew Musselman <> wrote:

That's correct; I reported that last summer and didn't fix it in M-1030
since it didn't seem like that's what the group wanted in that bug.

I see you're filing another bug, thanks.

On Fri, Jan 24, 2014 at 10:29 AM, Pat Ferrel <> wrote:

> I can’t believe I haven’t noticed this before and so am hoping I’m
> mistaken…
> When you are using kmeans to cluster data where there is no “named”
> vector, clusteredPoints do not contain the vector ids so the cluster id,
> pdf, “distance-squared”, and vector dimensions are not tied to any known
> vector and so are, well, pretty much useless afaict.
> This means you have to loop through all your input vectors, recalculate
> any of the above values you need and categorize them yourself, right? Is
> this how it’s meant to work?
> I have used clustering before but had named vectors (text docs). Anyone
> clustering some intermediate Mahout DRM or vectors with no names will have
> this problem.
> Someone please tell me I’ve slipped a gear...

View raw message