mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Chang <weidezhang2...@gmail.com>
Subject Re: question about clustering
Date Mon, 03 Oct 2011 17:38:30 GMT
Hi Kate,

I have 60 rows data that has text description. I just generated tf-idf using
my analyzer. and tf-idf vector is passed into the clustering algorithms to
do the clustering. I use k=3, it generates clusters-1, clusters-2 folder.
What does each folder mean ?  How does the clustering process generates
those ?

Weide

On Mon, Oct 3, 2011 at 8:04 AM, Kate Ericson <ericson@cs.colostate.edu>wrote:

> Hi Welde,
>
> As a disclaimer, I only know enough to try to help you figure out your
> first problem.
> First of all, can you tell us about the dataset you are using?
> How many points are you clustering?
>
> As a guess without knowing either of these things, part of the reason
> why your clusters look the same is that you're only clustering around
> 3 points.  You're only running for 2 iterations, so it looks like its
> just not moving your cluster centers around at all.  Can you try again
> with a larger k?
> This may let it run for more iterations so you should be able to see
> more changes in results.
>
> Good luck!
>
> -Kate
>
> On Sun, Oct 2, 2011 at 9:52 PM, Walter Chang <weidezhang2007@gmail.com>
> wrote:
> > Hi ,
> >
> > i have used mahout to produce kmeans  clustering for my tf-idf result. I
> use
> > the mahout command line to produce the clusters and it seems it
> successfully
> > completes.
> >
> > $MAHOUT_HOME/bin/mahout kmeans  -i ./tfidf-vectors -c ./initialclusters
> -o
> > ./kmeans-clusters  -cd 1.0 -k 3 -x 1000
> >
> > It seems there are two clusters directory generated.(cluster-1 and
> > cluster-2)  , when i use clusterdump on each of them, it seems to me that
> > the clustered top terms are the same. Any idea why ?
> >
> > Also, how can i see which documents have been assigned to each cluster.
> > Right now, i can see the number of documents assigned but not the
> complete
> > list.
> >
> > Most importantly, for production purposes, i assume it makes sense for
> > kmeans always runs on hadoop to generate the clustering file. But how do
> i
> > consume these during serving ? Ideally, serving should have the doc id or
> > query passed as a query, and the server should return the top document
> > ranked by the score within the same cluster back. How do I do it in code
> ?
> > Any good examples ?
> >
> > Thanks a lot,
> >
> > Weide
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message