mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paritosh Ranjan (JIRA)" <>
Subject [jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters
Date Mon, 22 Oct 2012 14:58:12 GMT


Paritosh Ranjan commented on MAHOUT-1103:

"I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2"

Since its not working for even two clusters, I don't see any problem due to the Partitioner.
The input here looks like the output of SSVD. There has been problems reported earlier also,
where SSVD output was creating problems in clustering.

Can your try kmeans + clusterpp without performing SSVD on the vectors? I suspect this to
be the problem for now, but we will have to test it. 

The sequential and mapreduce versions are completely differernt implementations, so, its normal
to have a bug in one version which is not present in the second version.

Please update once you test it.
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>                 Key: MAHOUT-1103
>                 URL:
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>              Labels: clusterpp
> After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories
for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that fails to
produce directories there is an empty part-r-* file in the output directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters
-dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843
and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by the default
hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close.
Putting both cluster names into a Text and caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message