mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Updating clusters
Date Fri, 16 Jul 2010 16:06:42 GMT

On Jul 16, 2010, at 11:27 AM, Asif Rahman wrote:

> Can anyone provide some advice on how to update an existing clustering with
> new data points.  Our data set is approximately 1mm newspaper headlines over
> the course of a month.  I'm able to get a high quality clustering using the
> existing mahout tasks (I'm just using canopy in this instance)

[OT] Care to share more (since you've already said you are using it)?  https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout

> but I'd like
> to update the clusters on an hourly basis.  Given the hardware that is
> available to me, I won't be able to run the clustering to completion over
> the entire data set every hour.  Are there any methods for completing such a
> task?

How many new docs are you talking in that hour?  I'm sure others can add here, but AIUI, people
in this situation often calculate the clusters and then for new docs in some time period,
they just see which cluster that new document is closest to and add it there, then, offline
or "later" they recluster the whole set.  So, for instance, perhaps nightly or every 6 hours
or whatever you can afford, you do the whole job, but then in between you just do the lighter
weight calculation.  I imagine there are probably ways of calculating when a new cluster is
needed or when quality has dropped too much, so perhaps that could be used to trigger a new
full run, too.

> 
> Since I'm not a mahout or linear algebra expert at this point, ideally the
> solution would involve a combination of the existing mahout tasks.  That
> being said, I'd be appreciative of any and all advice.
> 
> Thanks,
> 
> Asif
> 
> 
> -- 
> Asif Rahman
> Lead Engineer - NewsCred
> asif@newscred.com
> http://platform.newscred.com


Mime
View raw message