mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Musselman <andrew.mussel...@gmail.com>
Subject Re: Kmeans clusterdump Interpretation
Date Tue, 21 Jul 2015 01:11:08 GMT
I'm not sure centroid id is even a defined thing, especially since the
centroid, in my understanding, is just a point in space, not necessarily a
point in your data.

Are you trying to find the most-central point in a given cluster?

On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel <ankitgoel2004@gmail.com> wrote:

> Hi,
> I've been messing with mahout 0.10 and kmeans clustering with a solr 4.6.1
> index. The data is news articles. The --field option for kmeans is set to
> "content". The idField is set to "title" (just so i can analyse it faster).
> The clusterdump of the kmeans result gives me a proper output, but I cant
> figure out the id of the vector chosen as the center. There are only 14-15
> articles so I am not hung up about the cluster performance at this time.
>
> I used random seeds for the kmeans commandline.
> For reference, this is the commandline cluster dump I am executing
>
> bin/mahout clusterdump -i $MAHOUT_HOME/testCluster/clusters-3-final
> -p $MAHOUT_HOME/testCluster/clusteredPoints -d $MAHOUT_HOME/dict.txt -b 5
>
> The output I get is off the form
>
> :{"r":
>
> top terms
>
> xxxxx==>xxxxx
>
> Weight : [props - optional]:  Point:
>
>  1.0 : [distance=0.0]: [{"account":0.026}.......other features]
>
> 1.0 : [distance=0.3963903651622338]: [....]
>
>
> So how exactly do I get the centroid id? I have even tried accessing it
> with java
>
> ClusterWritable value.getValue().getCenter() but this just gives me the
> features and values of the centroid.
>
> Also, please do explain the meaning of "account":0.026 (just making sure I
> know it right). I used tfidf.
>
> --
> Regards,
> Ankit Goel
> http://about.me/ankitgoel
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message