mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edoardo Tosca <>
Subject Re: (Near) Realtime clustering
Date Wed, 24 Nov 2010 17:15:43 GMT
Thank you,
I am trying adding new documents but I'm stuck with an exception.
Basically I copied some code from KMeansDriver, and I execute the
clusterDataSeq method.
I have seen that the clusterDataSeq accepts a clusterIn Path parameter that
should be the path that contains already generated clusters.
Am I right?

When it try to emitPointToNearestCluster and in particular it calculate the
distance a CardinalityException is thrown:
what does it mean?

BTW I'm creating the vector getting documents from a Lucene index.

On Wed, Nov 24, 2010 at 5:00 PM, Jeff Eastman <> wrote:

> Note that the clustering drivers all have a static clusterData() method to
> run just the clustering (classification) of points. You would have to call
> this from your own driver as the current CLI does not offer just this
> option, but something like this should work:
> - Input documents are vectorized into sequence files which have timestamps
> so you know when to delete documents which have aged
> - Run full clustering over all remaining documents to produce clusters-n
> and clusteredPoints. This is the batch job over the entire corpus.
> - As new documents are received, use the clusterData() method to classify
> them using the previous clusters-n. This can be run using -xm sequential so
> it is all done in memory.
> - Periodically, add all the new documents to the corpus, delete any which
> have aged out of your time window, and start over
> -----Original Message-----
> From: Divya []
> Sent: Tuesday, November 23, 2010 6:32 PM
> To:
> Subject: RE: (Near) Realtime clustering
> Hi,
> Even I also have similar requirement.
> Can some one please provide me the steps of hybrid approach.
> Regards,
> Divya
> -----Original Message-----
> From: Jeff Eastman []
> Sent: Wednesday, November 24, 2010 2:19 AM
> To:
> Subject: RE: (Near) Realtime clustering
> I'd suggest a hybrid approach: Run the batch clustering periodically over
> the entire corpus to update the cluster centers and then use those centers
> for real-time clustering (classification) of new documents as they arrive.
> You can use the sequential execution mode of the clustering job to classify
> documents in real-time. This will suffer from the fact that new news topics
> will not immediately materialize new clusters until the batch job runs
> again.
> -----Original Message-----
> From: Gustavo Fernandes []
> Sent: Tuesday, November 23, 2010 9:58 AM
> To:
> Subject: (Near) Realtime clustering
> Hello, we have a mission to implement a system to cluster news articles in
> near real time mode. We have a large amount of articles (millions), and we
> started using k-means to created clusters based on a fixed value of "k".
> The
> problem is that we have a constant incoming flow of news articles and we
> can't afford to rely on a batch process, we need to be able to present
> users
> clustered articles as soon as they arrive in our database. So far our
> clusters are saved into a SequenceFile, as normally output by k-means
> driver.
> What would be the recommended way of approaching this problem with Mahout?
> Is it possible to manipulate the generated clusters and incrementally add
> new articles to them, or even forming new clusters without incurring the
> penalty of recalculating for every vector again? Is starting with k-means
> the right way? What would be the right combination of algorithms to provide
> incremental and fast clustering calculation?
> TIA,
> Gustavo

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message