mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: syntheticcontroldata clustering example failure due to combiner
Date Wed, 10 Jun 2009 23:30:15 GMT
Synthetic Control actually used to work with all the clustering jobs. 
The move to Hadoop 0.19 introduced intermittent problems that depend 
upon optimizations done behind the scenes in Hadoop. All of the original 
implementations used combiners under the assumption that they would only 
run after the mapper and they would run exactly once. These assumptions 
changed in 0.19.  M-99 fixed K-Means but not Canopy or Mean Shift which 
still have these assumptions.

Unfortunately, the combiner seems to run only once and only with the 
mappers in the development mode which is used by the build and all the 
unit tests. This caused the severity of the semantics change to remain 
undetected until recently when users are trying to run clustering on 
real Hadoop clusters.

The only solution I can imagine right now is to move the combiner 
centroid summation code back into the mappers and have the mappers 
output fully combined data during close(). It is not very elegant, 
perhaps someone has a better solution in mind. I will take a look at it 
tonight after the Hadoop Summit.

Jeff

Adil Aijaz wrote:
> Hi folks,
>
> I am new to mahout and I started exploring mahout 0.1 release by 
> trying to run the kmeans clustering example as described in 
> http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html
>
> After a bunch of runs where no matter what parameters I specified, the 
> output never changed I realized that:
>
> 1. KMeans was clustering all 600 points of syntheticcontroldata into 
> one cluster.
>
> 2. There is a bug in 
> examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java

> that called runJob from main function with my provided arguments 
> transposed. So, my convergenceDelta was interpreted as t1, t1 as t2, 
> and t2 as convergenceDelta. I will commit a patch as soon as I get 
> approval for opensource commits from my employer, however, I thought 
> I'd put it out there in case someone else is going through the same 
> issue.
>
> As for the more serious issue#1 (kmeans clustering everything into one 
> cluster), I found that this is because the CanopyClusteringJob was 
> generating only one canopy. Digging deeper, I found that this problem 
> was coming from the CanopyCombiner being run in both map & reduce 
> phases. From there I discovered this post from december 2008:
>
> http://tinyurl.com/l83ff4
>
> which indicates that from hadoop 0.18 onwards the combiner will be run 
> in both map and reduce which is bad since the CanopyCombiner and 
> KMeansCombiner assume that they are executed only on map side. Now, 
> the suggested workaround is specific to hadoop 0.18 and it doesn't 
> work with mahout-0.1 since it requires hadoop 0.19. This means a code 
> fix is needed for this issue. From the thread Grant talks about a 
> patch (MAHOUT-99) that fixes the code  but that patch is already part 
> of mahout-0.1 and so it apparently does not fix the issue.
>
> All that to say, I haven't been able to get the kmeans clustering 
> example on syntheticdata to work which is a bummer. My questions are:
>
> 1) Are there any open jiras on this issue (I didn't find any) ? If no, 
> should I create one?
> 2) Any workarounds for now?
>
>
> Adil
>
>


Mime
View raw message