Return-Path: X-Original-To: apmail-mahout-dev-archive@www.apache.org Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 673F2D3BF for ; Mon, 22 Oct 2012 18:00:15 +0000 (UTC) Received: (qmail 2688 invoked by uid 500); 22 Oct 2012 18:00:14 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 2597 invoked by uid 500); 22 Oct 2012 18:00:14 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 2533 invoked by uid 99); 22 Oct 2012 18:00:14 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Oct 2012 18:00:13 +0000 Date: Mon, 22 Oct 2012 18:00:13 +0000 (UTC) From: "Dmitriy Lyubimov (JIRA)" To: dev@mahout.apache.org Message-ID: <720630680.10747.1350928813998.JavaMail.jiratomcat@arcas> In-Reply-To: <113010923.9883.1350915491985.JavaMail.jiratomcat@arcas> Subject: [jira] [Comment Edited] (MAHOUT-1103) clusterpp is not writing directories for all clusters MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481557#comment-13481557 ] Dmitriy Lyubimov edited comment on MAHOUT-1103 at 10/22/12 6:00 PM: -------------------------------------------------------------------- bq. Hi Dmitriy, sorry for going a little off topic here, but could you elaborate on this? I've been experimenting with using either cosine or tanimoto distance on the USigma output of ssvd with -pca true. Are those not appropriate distance measures for the -pca output? I'll reply on user list, that way somebody will surely try to correct me since i have been doing straightforward LSA only. was (Author: dlyubimov): bq. Since its not working for even two clusters, I don't see any problem due to the Partitioner. The input here looks like the output of SSVD. There has been problems reported earlier also, where SSVD output was creating problems in clustering. I'll reply on user list, that way somebody will surely try to correct me since i have been doing straightforward LSA only. > clusterpp is not writing directories for all clusters > ----------------------------------------------------- > > Key: MAHOUT-1103 > URL: https://issues.apache.org/jira/browse/MAHOUT-1103 > Project: Mahout > Issue Type: Bug > Components: Clustering > Affects Versions: 0.8 > Reporter: Matt Molek > Assignee: Paritosh Ranjan > Labels: clusterpp > > After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is. > I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2 > Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory. > Here is my command sequence for the k=2 run: > {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl > bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt > bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} > The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively. > Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close. Putting both cluster names into a Text and caling hashCode() gives: > VL-3742464 -> -685560454 > VL-3742466 -> -685560452 > Finally, when running with "-xm sequential", everything performs as expected. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira