Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mahout.apache.org
Received-SPF: pass (athena.apache.org: domain of weidezhang2007@gmail.com
 designates 209.85.213.42 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAL4To688JhT_Bx8D4kNkXG_hWBMJnAThKejbdJiYR9Hmavxetw@mail.gmail.com>
References: 
 <CAMvZ5=7ZcT-u0pKksr7sCdrji8-h4uK+mw3NjFdbCMMt1aY+uw@mail.gmail.com>
	<CAL4To688JhT_Bx8D4kNkXG_hWBMJnAThKejbdJiYR9Hmavxetw@mail.gmail.com>
Date: Mon, 3 Oct 2011 10:38:30 -0700
Message-ID: 
 <CAMvZ5=75AozS1BNszCo5bA9ngkhsAaCO2MNg9V-n1uFCtDCPFg@mail.gmail.com>
Subject: Re: question about clustering
From: Walter Chang <weidezhang2007@gmail.com>
To: user@mahout.apache.org
Content-Type: multipart/alternative; boundary=0015174fecec7dc21f04ae6871da

--0015174fecec7dc21f04ae6871da
Content-Type: text/plain; charset=ISO-8859-1

Hi Kate,

I have 60 rows data that has text description. I just generated tf-idf using
my analyzer. and tf-idf vector is passed into the clustering algorithms to
do the clustering. I use k=3, it generates clusters-1, clusters-2 folder.
What does each folder mean ?  How does the clustering process generates
those ?

Weide

On Mon, Oct 3, 2011 at 8:04 AM, Kate Ericson <ericson@cs.colostate.edu>wrote:

> Hi Welde,
>
> As a disclaimer, I only know enough to try to help you figure out your
> first problem.
> First of all, can you tell us about the dataset you are using?
> How many points are you clustering?
>
> As a guess without knowing either of these things, part of the reason
> why your clusters look the same is that you're only clustering around
> 3 points.  You're only running for 2 iterations, so it looks like its
> just not moving your cluster centers around at all.  Can you try again
> with a larger k?
> This may let it run for more iterations so you should be able to see
> more changes in results.
>
> Good luck!
>
> -Kate
>
> On Sun, Oct 2, 2011 at 9:52 PM, Walter Chang <weidezhang2007@gmail.com>
> wrote:
> > Hi ,
> >
> > i have used mahout to produce kmeans  clustering for my tf-idf result. I
> use
> > the mahout command line to produce the clusters and it seems it
> successfully
> > completes.
> >
> > $MAHOUT_HOME/bin/mahout kmeans  -i ./tfidf-vectors -c ./initialclusters
> -o
> > ./kmeans-clusters  -cd 1.0 -k 3 -x 1000
> >
> > It seems there are two clusters directory generated.(cluster-1 and
> > cluster-2)  , when i use clusterdump on each of them, it seems to me that
> > the clustered top terms are the same. Any idea why ?
> >
> > Also, how can i see which documents have been assigned to each cluster.
> > Right now, i can see the number of documents assigned but not the
> complete
> > list.
> >
> > Most importantly, for production purposes, i assume it makes sense for
> > kmeans always runs on hadoop to generate the clustering file. But how do
> i
> > consume these during serving ? Ideally, serving should have the doc id or
> > query passed as a query, and the server should return the top document
> > ranked by the score within the same cluster back. How do I do it in code
> ?
> > Any good examples ?
> >
> > Thanks a lot,
> >
> > Weide
> >
>

--0015174fecec7dc21f04ae6871da--