Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A65929F92 for ; Tue, 8 May 2012 21:51:30 +0000 (UTC) Received: (qmail 19397 invoked by uid 500); 8 May 2012 21:51:29 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 19344 invoked by uid 500); 8 May 2012 21:51:29 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 19335 invoked by uid 99); 8 May 2012 21:51:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 May 2012 21:51:29 +0000 X-ASF-Spam-Status: No, hits=-1.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_SOFTFAIL X-Spam-Check-By: apache.org Received-SPF: softfail (athena.apache.org: transitioning domain of dli@operasolutions.com does not designate 64.18.2.169 as permitted sender) Received: from [64.18.2.169] (HELO exprod7og108.obsmtp.com) (64.18.2.169) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 May 2012 21:51:23 +0000 Received: from webmail.operasolutions.com ([70.47.54.66]) (using TLSv1) by exprod7ob108.postini.com ([64.18.6.12]) with SMTP ID DSNKT6mVRS7UILRqZR1ZuFQHIq3GfsrJ5cOK@postini.com; Tue, 08 May 2012 14:51:02 PDT Received: from opera-ex5.ny.os.local ([172.20.3.66]) by opera-ex5.ny.os.local ([172.20.3.66]) with mapi; Tue, 8 May 2012 17:49:55 -0400 From: Danfeng Li To: "user@mahout.apache.org" Date: Tue, 8 May 2012 17:51:17 -0400 Subject: RE: kmeans not returning k clusters Thread-Topic: kmeans not returning k clusters Thread-Index: Ac0tNV4RonWuS6t/QtaiTjvrfxD0ZgALqpNA Message-ID: <79A5BC65BFC37343844D4BB8A05DD3EE01290046EF@opera-ex5.ny.os.local> References: <4FA6E3F0.8080507@occamsmachete.com> <4FA78624.4060100@xebia.com> <4FA7D91D.7000403@farfetchers.com> <4FA91C64.6050902@xebia.com> <4FA9460B.4090307@occamsmachete.com> In-Reply-To: <4FA9460B.4090307@occamsmachete.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org I got the same issue. What I found is that the initial centers have many em= pty ones, the final number of clusters are decided by the number of nonempt= y centers. Here are some example of my cases: ... CL-34358205{n=3D0 c=3D[] r=3D[]} CL-34358207{n=3D0 c=3D[] r=3D[]} CL-34358209{n=3D0 c=3D[] r=3D[]} CL-34358213{n=3D0 c=3D[0:1.000] r=3D[]} CL-34358215{n=3D0 c=3D[] r=3D[]} CL-34358216{n=3D0 c=3D[] r=3D[]} CL-34358217{n=3D0 c=3D[] r=3D[]} CL-34358220{n=3D0 c=3D[] r=3D[]} CL-34358221{n=3D0 c=3D[] r=3D[]} CL-34358222{n=3D0 c=3D[] r=3D[]} CL-34358223{n=3D0 c=3D[] r=3D[]} CL-34358224{n=3D0 c=3D[] r=3D[]} CL-34358227{n=3D0 c=3D[0:1.000] r=3D[]} CL-34358228{n=3D0 c=3D[] r=3D[]} CL-34358229{n=3D0 c=3D[] r=3D[]} ... Is it the case there is a bug in initialization? Thanks. Dan -----Original Message----- From: Pat Ferrel [mailto:pat@occamsmachete.com]=20 Sent: Tuesday, May 08, 2012 9:13 AM To: user@mahout.apache.org Subject: Re: kmeans not returning k clusters Here is a sample data set. In this case I asked for 30 and got 28 but in ot= her cases the discrepancy has been greater like ask for 200 and get 38 but = that was for a much larger data set. Running on my mac laptop in a single node pseudo cluster hadoop 0.20.205, m= ahout 0.6 command line: mahout kmeans \ -i b2/bixo-vectors/tfidf-vectors/ \ -c b2/bixo-kmeans-centroids \ -cl \ -o b2/bixo-kmeans-clusters \ -k 30 \ -ow \ -cd 0.01 \ -x 20 \ -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure Find the data here: http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=3D0b2dacddc= a05c0ee48cbebd05048434425b86740 BTW when I run rowsimilarity asking for 20 similar docs I get a max of 20 but sometimes many less. Shouldn't this always return the requested numb= er? I'll post this question again to the the attention of the right person. On 5/8/12 6:15 AM, Paritosh Ranjan wrote: > I looked at the 0.6 version's code but was not able to find any reason. > If possible, can you share the data you are trying to cluster along=20 > with the execution parameters? > > You can also open a Jira for this and provide the info there. > > On 07-05-2012 19:45, Pat Ferrel wrote: >> 0.6 >> >> I take it this is not expected behavior? I could be doing something=20 >> stupid. I only look in the "final" directory. Looking in the others=20 >> with clusterdump shows the same number of clusters and I assumed they=20 >> were iterations. >> >> On 5/7/12 1:21 AM, Paritosh Ranjan wrote: >>> Which version are you using ? 0.6 or the current 0.7-snapshot? >>> >>> On 07-05-2012 02:19, Pat Ferrel wrote: >>>> What would cause kmeans to not return k clusters? As I tweak=20 >>>> parameters I get different numbers of clusters but it's usually=20 >>>> less than the k I pass in. Since I am not using canopies at present=20 >>>> I would expect k to always be honored but the quality of the=20 >>>> clusters would depend on the convergence amount and number of=20 >>>> iterations allowed. No? >>> >>> >>> > > >