mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sharath jagannath <>
Subject Re: Another set of basic questions
Date Fri, 04 Feb 2011 21:16:00 GMT
I would really appreciate if somebody could respond.
I am trying to do a online clustering of feed data.

I am now able to write my custom analyzer and create Tf-vectors, use canopy
as seed generator and cluster using KMeansDriver.
Question1: I want to save the centroids generated. Is there a specific
interface with which I can create backups/ Should I have to read it and save
somewhere else say database for further use.

Say now I have 100 article and have grouped them into 10 clusters.
With which I want to cluster the new feed. Lets say I have 10 more article.

My first approach:
 I can use the same cycle to achieve reclustering which takes time. So I do
not want to do it for my online clustering.

Second Approach:
I want to use the saved centroids generated in the initial phase and cluster
using Canopy Driver. But Canopy driver takes vector as input and generate
Question2 :Can we do it with Canopy Driver? I want to use the previous

If this possible, let say out of my 10 new articles. 8 is grouped to one of
the existing cluster but 2 are new. To achieve this I need previous
I want to cluster the new 2 in the usual kmeans and form new cluster.
Question3: How should I add the centroids of the new clusters formed to the
initial centroid list?

Again, I would appreciate the response. I know my questions are bit stupid
but for a novice I guess that is expected.


On Fri, Feb 4, 2011 at 9:38 AM, sharath jagannath <> wrote:

> anybody please?
> Thanks,
> Sharath
> On Thu, Feb 3, 2011 at 10:39 PM, sharath jagannath <
>> wrote:
>> I have 3 questions:
>> 1. Now that I am able to create clusters. I want to know how to find
>> intra-cluster distance between the data points say top m data points close
>> to me within my cluster.
>> 2. Say I have created initial cluster and now want to update it but do not
>> want to do it from scratch, I will use canopy to approximate the closest
>> cluster but how should I know what is the new cluster created from the data
>> points which are not part of any of the old cluster?
>> 3. Now after some time I want to recluster everything. How should I do it?
>> Where should I get the all the vectors? Should I have to recreate
>> everything?
>> Thanks,
>> Sharath

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message