mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adil Aijaz <a...@yahoo-inc.com>
Subject syntheticcontroldata clustering example failure due to combiner
Date Wed, 10 Jun 2009 17:49:31 GMT
Hi folks,

I am new to mahout and I started exploring mahout 0.1 release by trying 
to run the kmeans clustering example as described in 
http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html

After a bunch of runs where no matter what parameters I specified, the 
output never changed I realized that:

1. KMeans was clustering all 600 points of syntheticcontroldata into one 
cluster.

2. There is a bug in 
examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java 
that called runJob from main function with my provided arguments 
transposed. So, my convergenceDelta was interpreted as t1, t1 as t2, and 
t2 as convergenceDelta. I will commit a patch as soon as I get approval 
for opensource commits from my employer, however, I thought I'd put it 
out there in case someone else is going through the same issue.

As for the more serious issue#1 (kmeans clustering everything into one 
cluster), I found that this is because the CanopyClusteringJob was 
generating only one canopy. Digging deeper, I found that this problem 
was coming from the CanopyCombiner being run in both map & reduce 
phases. From there I discovered this post from december 2008:

http://tinyurl.com/l83ff4

which indicates that from hadoop 0.18 onwards the combiner will be run 
in both map and reduce which is bad since the CanopyCombiner and 
KMeansCombiner assume that they are executed only on map side. Now, the 
suggested workaround is specific to hadoop 0.18 and it doesn't work with 
mahout-0.1 since it requires hadoop 0.19. This means a code fix is 
needed for this issue. From the thread Grant talks about a patch 
(MAHOUT-99) that fixes the code  but that patch is already part of 
mahout-0.1 and so it apparently does not fix the issue.

All that to say, I haven't been able to get the kmeans clustering 
example on syntheticdata to work which is a bummer. My questions are:

1) Are there any open jiras on this issue (I didn't find any) ? If no, 
should I create one?
2) Any workarounds for now?


Adil

Mime
View raw message