Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B46F498FF for ; Mon, 3 Oct 2011 17:39:01 +0000 (UTC) Received: (qmail 81806 invoked by uid 500); 3 Oct 2011 17:38:59 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 81761 invoked by uid 500); 3 Oct 2011 17:38:59 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 81724 invoked by uid 99); 3 Oct 2011 17:38:59 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Oct 2011 17:38:59 +0000 X-ASF-Spam-Status: No, hits=1.6 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of weidezhang2007@gmail.com designates 209.85.213.42 as permitted sender) Received: from [209.85.213.42] (HELO mail-yw0-f42.google.com) (209.85.213.42) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Oct 2011 17:38:54 +0000 Received: by ywa8 with SMTP id 8so6720781ywa.1 for ; Mon, 03 Oct 2011 10:38:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=rTX952LpkEjsvMHxzZBjmLP2Xun5qgwaOB0wFdJY/6Q=; b=dV2q/AIIg9KlYOHf/7p6vCf3wwfi3zHqcC6NpFKCtEXg2I5fKsIVGkuFtyo9YV0zLM n8pe/3hLdxV14+Z6JeY4NBo5X5l4s+Y9+RXss3MXd+Oh5QYAEXAWl8lDxautQ4+e2AvM lE0SJloWBhXhT8WTLklWDRqnvL+EuMYJYjSbU= MIME-Version: 1.0 Received: by 10.151.38.5 with SMTP id q5mr377010ybj.226.1317663510741; Mon, 03 Oct 2011 10:38:30 -0700 (PDT) Received: by 10.151.103.21 with HTTP; Mon, 3 Oct 2011 10:38:30 -0700 (PDT) In-Reply-To: References: Date: Mon, 3 Oct 2011 10:38:30 -0700 Message-ID: Subject: Re: question about clustering From: Walter Chang To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=0015174fecec7dc21f04ae6871da --0015174fecec7dc21f04ae6871da Content-Type: text/plain; charset=ISO-8859-1 Hi Kate, I have 60 rows data that has text description. I just generated tf-idf using my analyzer. and tf-idf vector is passed into the clustering algorithms to do the clustering. I use k=3, it generates clusters-1, clusters-2 folder. What does each folder mean ? How does the clustering process generates those ? Weide On Mon, Oct 3, 2011 at 8:04 AM, Kate Ericson wrote: > Hi Welde, > > As a disclaimer, I only know enough to try to help you figure out your > first problem. > First of all, can you tell us about the dataset you are using? > How many points are you clustering? > > As a guess without knowing either of these things, part of the reason > why your clusters look the same is that you're only clustering around > 3 points. You're only running for 2 iterations, so it looks like its > just not moving your cluster centers around at all. Can you try again > with a larger k? > This may let it run for more iterations so you should be able to see > more changes in results. > > Good luck! > > -Kate > > On Sun, Oct 2, 2011 at 9:52 PM, Walter Chang > wrote: > > Hi , > > > > i have used mahout to produce kmeans clustering for my tf-idf result. I > use > > the mahout command line to produce the clusters and it seems it > successfully > > completes. > > > > $MAHOUT_HOME/bin/mahout kmeans -i ./tfidf-vectors -c ./initialclusters > -o > > ./kmeans-clusters -cd 1.0 -k 3 -x 1000 > > > > It seems there are two clusters directory generated.(cluster-1 and > > cluster-2) , when i use clusterdump on each of them, it seems to me that > > the clustered top terms are the same. Any idea why ? > > > > Also, how can i see which documents have been assigned to each cluster. > > Right now, i can see the number of documents assigned but not the > complete > > list. > > > > Most importantly, for production purposes, i assume it makes sense for > > kmeans always runs on hadoop to generate the clustering file. But how do > i > > consume these during serving ? Ideally, serving should have the doc id or > > query passed as a query, and the server should return the top document > > ranked by the score within the same cluster back. How do I do it in code > ? > > Any good examples ? > > > > Thanks a lot, > > > > Weide > > > --0015174fecec7dc21f04ae6871da--