Return-Path: Delivered-To: apmail-mahout-user-archive@www.apache.org Received: (qmail 10902 invoked from network); 10 Feb 2011 09:43:16 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 10 Feb 2011 09:43:16 -0000 Received: (qmail 93047 invoked by uid 500); 10 Feb 2011 09:43:15 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 92769 invoked by uid 500); 10 Feb 2011 09:43:12 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 92761 invoked by uid 500); 10 Feb 2011 09:43:11 -0000 Delivered-To: apmail-lucene-mahout-user@lucene.apache.org Received: (qmail 92758 invoked by uid 99); 10 Feb 2011 09:43:11 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Feb 2011 09:43:11 +0000 X-ASF-Spam-Status: No, hits=2.0 required=5.0 tests=SPF_NEUTRAL,URI_HEX X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [216.139.236.26] (HELO sam.nabble.com) (216.139.236.26) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Feb 2011 09:43:06 +0000 Received: from ben.nabble.com ([192.168.236.152]) by sam.nabble.com with esmtp (Exim 4.69) (envelope-from ) id 1PnT2y-0004Bp-VR for mahout-user@lucene.apache.org; Thu, 10 Feb 2011 01:42:44 -0800 Date: Thu, 10 Feb 2011 01:42:44 -0800 (PST) From: gabeweb To: mahout-user@lucene.apache.org Message-ID: <1297330964935-2464896.post@n3.nabble.com> Subject: Problem in distributed canopy clustering MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Hi, I think there is a significant problem in the distributed canopy clusterer. I've been comparing the in-memory version to the distributed version (clustering users in the GroupLens database), and they behave completely differently. Firstly, different T1/T2 parameters are required to get the same number of clusters -- even when the data and similarity metric are exactly the same. Secondly, even when I have tuned the parameters to get the same number of clusters, the distribution of cluster sizes is very different -- in particular, using e.g. Tanimoto distance, if there are N clusters, the distributed version likes to create N-1 singleton clusters, and put all the remaining vectors into the remaining cluster. I have traced this to the fact that given a single similarity metric, distances between sparse vectors tend to have a different range than distances between dense vectors. It first clusters (sparse) original vectors in each mapper, and then it takes the (dense) centroid vectors output by each mapper and applies the same canopy clustering using the same T1/T2 parameters. I confirmed this by using a single mapper and simply turning off the clustering of the reducing step (i.e., have the reducer just output the same centroids that are input to it); in this case, the clustering is fine -- somewhat obviously, perhaps, because this makes the distributed algorithm behave exactly like the in-memory version. Specifically, with Tanimoto distance and the reducer effectively turned off, the average distance between original vectors is 0.984, and with T1 = T2 = 0.983 with 10% of the GroupLens data, I get 24 clusters. Then if I turn on the reducer, I only get one cluster, because the average distance between the dense centroids output by the mapper drops to 0.235, and so every centroid is now within T1 of every other centroid. If I want a similar number clusters in the unmodified distributed version, I have to decrease T1/T2 to 0.939, which gives 23 clusters, but much less evenly distributed (the largest cluster now contains 6779 vectors, which is 97% of the input vectors, as opposed to 2684 in the in-memory/turned-off-reducer version), due to some property of the mapper having generated many more clusters (257) as a trade-off for the T1/T2 now being appropriate for the different similarity values of the reducer stage. Is this a known shortcoming of distributed canopy? Or am I missing something? It seems to me that for this to work, different T1/T2 parameters would be needed for the mapper and reducer steps. That would be easy to program, but it would make tuning the parameters a lot harder -- unless there were some clever way to automatically adjust the parameters based on how sparse the vectors being clustered were. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-in-distributed-canopy-clustering-tp2464896p2464896.html Sent from the Mahout User List mailing list archive at Nabble.com.