Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DC5869D89 for ; Tue, 22 Nov 2011 11:43:31 +0000 (UTC) Received: (qmail 98631 invoked by uid 500); 22 Nov 2011 11:43:30 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 98418 invoked by uid 500); 22 Nov 2011 11:43:30 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 98410 invoked by uid 99); 22 Nov 2011 11:43:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 Nov 2011 11:43:30 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of fotero@gmail.com designates 209.85.215.170 as permitted sender) Received: from [209.85.215.170] (HELO mail-ey0-f170.google.com) (209.85.215.170) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 Nov 2011 11:43:26 +0000 Received: by eyg7 with SMTP id 7so137331eyg.1 for ; Tue, 22 Nov 2011 03:43:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=NVPLmHAhlXBkXo04kbMv8YIRB+ZdDlsZKKtt4h67FxU=; b=M9dYwS4q265GDpbCeb0R1Brtt2xnKK8UqBCm8jVn9+/O35k4eT9+4AK0iJDjg1vQ0B wNBkZbxix4P8/PUWQWYtgLC5SpMtE7cIE6ZOrcj0HN2mwSwrtznik6AAMwbquvnJxp6a a6JhPWoxjJ7FBWA0GVQgTwDRV0dSKW6dcV2ek= MIME-Version: 1.0 Received: by 10.213.7.77 with SMTP id c13mr187499ebc.111.1321962184831; Tue, 22 Nov 2011 03:43:04 -0800 (PST) Received: by 10.14.53.71 with HTTP; Tue, 22 Nov 2011 03:43:04 -0800 (PST) In-Reply-To: <4ECB7E8F.8030700@xebia.com> References: <4ECB7E8F.8030700@xebia.com> Date: Tue, 22 Nov 2011 08:43:04 -0300 Message-ID: Subject: Re: Clustering Question (from a newbie) From: "Fernando O." To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=0015174bdf666ef5a904b2514e59 --0015174bdf666ef5a904b2514e59 Content-Type: text/plain; charset=UTF-8 In ClusterIn I had #Categories clusters with initial centroid some arbitrary vector (I was using the first #Categories vectors that got). I realized that since I had percentages I could create arbitrary centroids giving 0.5 value on the corresponding category and 0 on the others. Turns out that it work really good :D I still have to take a better look but it looks correct. Now I'm wondering if there is any paper that supports my assumption On Tue, Nov 22, 2011 at 7:50 AM, Paritosh Ranjan wrote: > public static void run(Configuration conf, > Path input, > Path clustersIn, > Path output,... > > The second parameter is clustersIn. What are you providing there? > > I propose that you first use CanopyClustering to find the appropriate > number of clusters present. And then give them as the input in clustersIn. > You might be giving the wrong clustersIn which can create problems. > > Paritosh > > > On 22-11-2011 16:12, Fernando O. wrote: > >> Hi all, >> Disclaimer: I'm a total newbie in datamining / clustering / AI / and >> all the areas around.My knowledge of clustering is basically what I learn >> in my cs regular courses but never did research/work with this before. >> >> Any reading recomendation would be much appreciated :D >> >> I'm trying to understand a large set of data: I have a set of Geographical >> regions, and for each region I have N characteristics or categories, let's >> say the measure that I have is something like an indicator of the >> importance of that characteristic in that region. >> >> So I have a table somthing like this >> C1 C2 C3 >> R1 80% 20% 0% >> R2 75% 25% 0% >> R3 50% 20% 30% >> >> From what I read Kmeans works pretty well for most cases, so I choosed to >> use that clustering technique. >> Then I used the Tanimoto Distance because I wanted to measure the >> correlation between categories. >> >> Right now I have a small set: 148 Regions and 13 Categories. From those >> 148 >> Regions only one has more than 1% in Cn, and it has in fact 36%. >> >> So I would expect that if I set the number of clusters to something >> relatively large (15 or 20) I would get a cluster with only that region >> having Cn=36% >> >> My problem is that I couldn't make it happen so I'm not sure why this is >> happening. In fact I have some empty clusters. >> R158,30%1,10%0,00%5,66%5,55%2,**24%1,42%3,20%1,12%14,75%6,23%** >> 0,25%0,01%0,16%R2 >> 37,08%1,95%0,00%26,27%4,86%0,**11%0,00%0,00%0,76%7,78%18,16%** >> 0,00%0,00%0,00%R3 >> 48,86%3,03%6,14%5,98%7,91%1,**85%1,69%3,55%0,43%15,63%4,83%** >> 0,09%0,00%0,00%*R4* >> *8,86%**0,59%**6,60%**2,46%****2,06%**1,26%**0,26%**1,71%**0,** >> 47%**6,11%**7,43% >> **0,03%**61,96%**0,21%*R551,**56%2,55%0,00%16,08%7,29%0,49%** >> 3,31%1,22%0,47% >> 13,49%3,53%0,01%0,00%0,00%**R640,15%6,26%0,00%8,07%5,25%0,** >> 20%0,45%13,29%1,28% >> 12,85%11,64%0,00%0,00%0,55% >> >> >> Running Kmeans like this: >> KMeansDriver.run(conf, new Path("mahoutTest/regions"), new Path( >> "testdata/clusters"), new Path("output"), >> new TanimotoDistanceMeasure(), 0.001, 1000, true, false); >> >> The vectors for each Region are in 1/100 (that 8.86 is 0.0886) >> >> Any Idea of what I might be doing wrong ? (please don't say everything! >> :D ) >> >> Thanks a lot! >> >> >> >> ----- >> No virus found in this message. >> Checked by AVG - www.avg.com >> Version: 10.0.1411 / Virus Database: 2092/4030 - Release Date: 11/21/11 >> > > --0015174bdf666ef5a904b2514e59--