Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 66F3A9EA2 for ; Mon, 3 Oct 2011 05:57:14 +0000 (UTC) Received: (qmail 90822 invoked by uid 500); 3 Oct 2011 05:57:13 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 90780 invoked by uid 500); 3 Oct 2011 05:57:12 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 90755 invoked by uid 99); 3 Oct 2011 05:57:12 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Oct 2011 05:57:12 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [93.94.224.194] (HELO owa.exchange-login.net) (93.94.224.194) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Oct 2011 05:57:06 +0000 Received: from HC1.hosted.exchange-login.net (93.94.224.200) by edge1.hosted.exchange-login.net (93.94.224.194) with Microsoft SMTP Server (TLS) id 14.1.339.1; Mon, 3 Oct 2011 07:56:43 +0200 Received: from [192.168.1.50] (203.187.240.190) by hc1.hosted.exchange-login.net (93.94.224.204) with Microsoft SMTP Server (TLS) id 14.1.339.1; Mon, 3 Oct 2011 07:56:42 +0200 Message-ID: <4E894E8D.5000000@xebia.com> Date: Mon, 3 Oct 2011 11:26:29 +0530 From: Paritosh Ranjan User-Agent: Mozilla/5.0 (Windows NT 6.0; rv:7.0.1) Gecko/20110929 Thunderbird/7.0.1 MIME-Version: 1.0 To: Subject: Re: Difference in results : Clustering : sequential and MapReduce References: <4E876286.5040205@xebia.com> <4E88228C.6040902@xebia.com> <4E89331F.2080705@xebia.com> In-Reply-To: <4E89331F.2080705@xebia.com> Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [203.187.240.190] I got the reason for difference. Actually, its due to if (canopy.getNumPoints()> clusterFilter) in CanopyMapper. Similar data is not distributed evenly in the mappers. So, the canopies might come out with points < clusterFilter which are not processed further. But, this check is a great performance enhancer. I have experienced that. Maybe, distributing similar vectors on mappers might help to attain both quality and performance. On 03-10-2011 09:29, Paritosh Ranjan wrote: > The sequential algorithm finds more/better clusters than the > mapreduce one. > There's not a huge difference, but the standalone one is better for sure. > > Thanks and Regards, > Paritosh > > On 03-10-2011 01:47, Konstantin Shmakov wrote: >> I'd assume that distributed and sequential algorithms shouldn't produce >> identical results. To start with, they differ in initial setup: >> -- In distributed algorithm each mapper deals with subset of data and >> starts >> by picking up a random point, so N random points are picked up by N >> mappers >> to start with. >> -- In sequential algorithm 1 mapper deals with all data and starts by >> picking up 1 random point. >> But for the data with real clusters both algorithms should produce >> similar >> results. How different are the results in your case? >> >> Thanks >> --Konstantin >> >> >> >> >> >> >> >> >> On Sun, Oct 2, 2011 at 1:36 AM, Paritosh Ranjan >> wrote: >> >>> Even run() of CanopyDriver, which takes only T1 and T2 is giving >>> different >>> results for sequential and mapreduce. >>> This is preventing me from scaling up, as I need to run mapreduce on >>> hadoop >>> to scale. >>> >>> Is anyone having any idea of this problem? >>> >>> On 02-10-2011 00:27, Paritosh Ranjan wrote: >>> >>>> Hi, >>>> >>>> I am able to cluster correctly sequentially, using CanopyDriver. >>>> >>>> However, the same dataset, when processed as a MapReduce job, where >>>> ( t1 = >>>> t3 and t2 = t4 and t1>t2) is not working. I am getting errors like >>>> Canopies >>>> are empty. >>>> >>>> I also tried to reduce the values of t3 and t4. But reducing it >>>> either has >>>> no effect or gives meaningless results. >>>> >>>> Am I doing something wrong? or is there a bug somewhere? >>>> >>>> I feel that both, sequential and MapReduce should give similar >>>> results. >>>> But, It is not happening. >>>> >>>> Thanks and Regards, >>>> Paritosh >>>> >>>> >>>> ----- >>>> No virus found in this message. >>>> Checked by AVG - www.avg.com >>>> Version: 10.0.1410 / Virus Database: 1520/3932 - Release Date: >>>> 10/01/11 >>>> >>> >> > > > > ----- > No virus found in this message. > Checked by AVG - www.avg.com > Version: 10.0.1410 / Virus Database: 1520/3933 - Release Date: 10/02/11