mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gustavo Fernandes <>
Subject (Near) Realtime clustering
Date Tue, 23 Nov 2010 17:58:27 GMT
Hello, we have a mission to implement a system to cluster news articles in near real time mode.
We have a large amount of articles (millions), and we started using k-means to created clusters
based on a fixed value of "k". The problem is that we have a constant incoming flow of news
articles and we can't afford to rely on a batch process, we need to be able to present users
clustered articles as soon as they arrive in our database. So far our clusters are saved into
a SequenceFile, as normally output by k-means driver. 
What would be the recommended way of approaching this problem with Mahout? Is it possible
to manipulate the generated clusters and incrementally add new articles to them, or even forming
new clusters without incurring the penalty of recalculating for every vector again? Is starting
with k-means the right way? What would be the right combination of algorithms to provide incremental
and fast clustering calculation?

View raw message