mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edoardo Tosca <e.to...@sourcesense.com>
Subject Re: (Near) Realtime clustering
Date Mon, 29 Nov 2010 15:17:42 GMT
Thank you for your tips,
we are not using Lucene index anymore, we are creating sequence file from
text documents (like the reuters example).
While generating vectors we changed the weight option from TFIDF to TF.
But we still have Cardinality Exception.
Any clues?

How can we set the size of input vectors?

Thank you in advance.
Edoardo


On Wed, Nov 24, 2010 at 5:28 PM, Jeff Eastman <jeastman@narus.com> wrote:

> It likely means that your cluster's cardinality is different from your
> input vector's cardinality. If your input vectors are term vectors computed
> from Lucene, then this could occur if a new term is introduced, increasing
> the size of the input vector. I can also see some problems if you are using
> seq2sparse for just the new vector, as that builds a new term dictionary.
> Also, TF-IDF wants to analyze the term frequencies over the entire corpus
> which won't work incrementally.
>
> I think you can fool the clustering by setting the sizes of your input
> vectors to be max_int but that won't help you with the other issues above.
> Our text processing algorithms will take some adjustments to handle this
> preprocessing correctly.
>
> -----Original Message-----
> From: Edoardo Tosca [mailto:e.tosca@sourcesense.com]
> Sent: Wednesday, November 24, 2010 9:16 AM
> To: user@mahout.apache.org
> Subject: Re: (Near) Realtime clustering
>
> Thank you,
> I am trying adding new documents but I'm stuck with an exception.
> Basically I copied some code from KMeansDriver, and I execute the
> clusterDataSeq method.
> I have seen that the clusterDataSeq accepts a clusterIn Path parameter that
> should be the path that contains already generated clusters.
> Am I right?
>
> When it try to emitPointToNearestCluster and in particular it calculate the
> distance a CardinalityException is thrown:
> what does it mean?
>
> BTW I'm creating the vector getting documents from a Lucene index.
>
> On Wed, Nov 24, 2010 at 5:00 PM, Jeff Eastman <jeastman@narus.com> wrote:
>
> > Note that the clustering drivers all have a static clusterData() method
> to
> > run just the clustering (classification) of points. You would have to
> call
> > this from your own driver as the current CLI does not offer just this
> > option, but something like this should work:
> >
> > - Input documents are vectorized into sequence files which have
> timestamps
> > so you know when to delete documents which have aged
> > - Run full clustering over all remaining documents to produce clusters-n
> > and clusteredPoints. This is the batch job over the entire corpus.
> > - As new documents are received, use the clusterData() method to classify
> > them using the previous clusters-n. This can be run using -xm sequential
> so
> > it is all done in memory.
> > - Periodically, add all the new documents to the corpus, delete any which
> > have aged out of your time window, and start over
> >
> >
> >
> > -----Original Message-----
> > From: Divya [mailto:divya@k2associates.com.sg]
> > Sent: Tuesday, November 23, 2010 6:32 PM
> > To: user@mahout.apache.org
> > Subject: RE: (Near) Realtime clustering
> >
> > Hi,
> >
> > Even I also have similar requirement.
> > Can some one please provide me the steps of hybrid approach.
> >
> >
> > Regards,
> > Divya
> >
> > -----Original Message-----
> > From: Jeff Eastman [mailto:jeastman@Narus.com]
> > Sent: Wednesday, November 24, 2010 2:19 AM
> > To: user@mahout.apache.org
> > Subject: RE: (Near) Realtime clustering
> >
> > I'd suggest a hybrid approach: Run the batch clustering periodically over
> > the entire corpus to update the cluster centers and then use those
> centers
> > for real-time clustering (classification) of new documents as they
> arrive.
> > You can use the sequential execution mode of the clustering job to
> classify
> > documents in real-time. This will suffer from the fact that new news
> topics
> > will not immediately materialize new clusters until the batch job runs
> > again.
> >
> > -----Original Message-----
> > From: Gustavo Fernandes [mailto:gustavonalle@gmail.com]
> > Sent: Tuesday, November 23, 2010 9:58 AM
> > To: user@mahout.apache.org
> > Subject: (Near) Realtime clustering
> >
> > Hello, we have a mission to implement a system to cluster news articles
> in
> > near real time mode. We have a large amount of articles (millions), and
> we
> > started using k-means to created clusters based on a fixed value of "k".
> > The
> > problem is that we have a constant incoming flow of news articles and we
> > can't afford to rely on a batch process, we need to be able to present
> > users
> > clustered articles as soon as they arrive in our database. So far our
> > clusters are saved into a SequenceFile, as normally output by k-means
> > driver.
> > What would be the recommended way of approaching this problem with
> Mahout?
> > Is it possible to manipulate the generated clusters and incrementally add
> > new articles to them, or even forming new clusters without incurring the
> > penalty of recalculating for every vector again? Is starting with k-means
> > the right way? What would be the right combination of algorithms to
> provide
> > incremental and fast clustering calculation?
> >
> > TIA,
> > Gustavo
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message