mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ankit Goel <ankitgoel2...@gmail.com>
Subject Re: Kmeans clusterdump Interpretation
Date Tue, 21 Jul 2015 01:45:30 GMT
That kind of puts me in a tough position. I was planning to use kmeans as a
method for aggregating similar articles from multiple news sources, and
then getting a representative article from those. Here I mean similar as in
the articles are from different news sources but are about the exact same
thing. Intuitively it seems that these articles would get grouped
together. Any suggestions how I should go about that? So far I'm using
nutch to crawl, solr to index and now I'm here on mahout.

On Tue, Jul 21, 2015 at 7:10 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

> The most central point in a cluster is often referred to as a medoid
> (similar to median, but multi-dimensional).
>
> The Mahout code does not compute medoids.  In general, they are difficult
> to compute and implementing a full k-medoid clustering algorithm even more
> so.
>
>
>
> On Mon, Jul 20, 2015 at 6:25 PM, Ankit Goel <ankitgoel2004@gmail.com>
> wrote:
>
> > Oh, I thought kmeans gave me a point vector as a centroid, not a
> calculated
> > point central to a cluster. I guess in this case I would be looking for
> the
> > most central point vector (from the index ) that I can use as a
> > representative of the cluster.
> >
> > On Tue, Jul 21, 2015 at 6:41 AM, Andrew Musselman <
> > andrew.musselman@gmail.com> wrote:
> >
> > > I'm not sure centroid id is even a defined thing, especially since the
> > > centroid, in my understanding, is just a point in space, not
> necessarily
> > a
> > > point in your data.
> > >
> > > Are you trying to find the most-central point in a given cluster?
> > >
> > > On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel <ankitgoel2004@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > > I've been messing with mahout 0.10 and kmeans clustering with a solr
> > > 4.6.1
> > > > index. The data is news articles. The --field option for kmeans is
> set
> > to
> > > > "content". The idField is set to "title" (just so i can analyse it
> > > faster).
> > > > The clusterdump of the kmeans result gives me a proper output, but I
> > cant
> > > > figure out the id of the vector chosen as the center. There are only
> > > 14-15
> > > > articles so I am not hung up about the cluster performance at this
> > time.
> > > >
> > > > I used random seeds for the kmeans commandline.
> > > > For reference, this is the commandline cluster dump I am executing
> > > >
> > > > bin/mahout clusterdump -i $MAHOUT_HOME/testCluster/clusters-3-final
> > > > -p $MAHOUT_HOME/testCluster/clusteredPoints -d $MAHOUT_HOME/dict.txt
> > -b 5
> > > >
> > > > The output I get is off the form
> > > >
> > > > :{"r":
> > > >
> > > > top terms
> > > >
> > > > xxxxx==>xxxxx
> > > >
> > > > Weight : [props - optional]:  Point:
> > > >
> > > >  1.0 : [distance=0.0]: [{"account":0.026}.......other features]
> > > >
> > > > 1.0 : [distance=0.3963903651622338]: [....]
> > > >
> > > >
> > > > So how exactly do I get the centroid id? I have even tried accessing
> it
> > > > with java
> > > >
> > > > ClusterWritable value.getValue().getCenter() but this just gives me
> the
> > > > features and values of the centroid.
> > > >
> > > > Also, please do explain the meaning of "account":0.026 (just making
> > sure
> > > I
> > > > know it right). I used tfidf.
> > > >
> > > > --
> > > > Regards,
> > > > Ankit Goel
> > > > http://about.me/ankitgoel
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> > Ankit Goel
> > http://about.me/ankitgoel
> >
>



-- 
Regards,
Ankit Goel
http://about.me/ankitgoel

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message