mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jure Jeseni─Źnik <Jure.Jesenic...@planet9.si>
Subject RE: (Near) Realtime clustering
Date Thu, 02 Dec 2010 13:25:46 GMT
I am facing the same exception while trying to cluster vectors generated from two different
lucene indexes. Can you please give some more info on how to resolve this.  
I created both vectors with the mahout lucene.vector command. CardinalityExeption is thrown
while running canopy.

Thank you for your help.

Jure

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Monday, November 29, 2010 7:16 PM
To: user@mahout.apache.org
Subject: Re: (Near) Realtime clustering

Cardinality exceptions mean that feature vectors are variable size and
somehow vectors of
different sizes are being combined.  This problem is endemic with encodings
that have one
slot in the feature vector per vocabulary item in your input text.  The
Lucene vector extractor
knows what the total number of unique terms is so it can create vectors with
that dimension.
Lucene also allows for an easy way to assign terms to slots.

Unfortunately, if you are creating vectors on the fly, you don't necessarily
know what the total
vocabulary size is until you have seen the whole vocabulary.  In a parallel
environment, no
single machine sees the entire vocabulary so generally a pre-pass is
required to get the
corpus vocabulary.  The pre-pass generally is as expensive as an entire
clustering pass which
is a distinct grump.

I have not experimented with it yet, but you might try using the feature
vector encoders like
org.apache.mahout.vectorizer.encoders.TextValueEncoder.  That will give you
a fixed size
vector without regard to your vocabulary size.  It should also preserve the
meaning of your
distance metrics as much as possible.  Since the vector size does not depend
on the vocabulary
this can also avoid the pre-pass.

The defect is that vectors become somewhat harder to interpret.  There is
some provision for
decoding vectors using a trace dictionary, but using the trace dictionary
can kill your encoding
performance so it is preferable to only use the trace when you want to
interpret a vector.  The
ModelDissector class does some of this.  You can use that code as an
example.

I can go into more details if you would like.

On Mon, Nov 29, 2010 at 7:17 AM, Edoardo Tosca <e.tosca@sourcesense.com>wrote:

> Thank you for your tips,
> we are not using Lucene index anymore, we are creating sequence file from
> text documents (like the reuters example).
> While generating vectors we changed the weight option from TFIDF to TF.
> But we still have Cardinality Exception.
> Any clues?
>
> How can we set the size of input vectors?
>
> Thank you in advance.
> Edoardo
>
>
> On Wed, Nov 24, 2010 at 5:28 PM, Jeff Eastman <jeastman@narus.com> wrote:
>
> > It likely means that your cluster's cardinality is different from your
> > input vector's cardinality. If your input vectors are term vectors
> computed
> > from Lucene, then this could occur if a new term is introduced,
> increasing
> > the size of the input vector. I can also see some problems if you are
> using
> > seq2sparse for just the new vector, as that builds a new term dictionary.
> > Also, TF-IDF wants to analyze the term frequencies over the entire corpus
> > which won't work incrementally.
> >
> > I think you can fool the clustering by setting the sizes of your input
> > vectors to be max_int but that won't help you with the other issues
> above.
> > Our text processing algorithms will take some adjustments to handle this
> > preprocessing correctly.
> >
> > -----Original Message-----
> > From: Edoardo Tosca [mailto:e.tosca@sourcesense.com]
> > Sent: Wednesday, November 24, 2010 9:16 AM
> > To: user@mahout.apache.org
> > Subject: Re: (Near) Realtime clustering
> >
> > Thank you,
> > I am trying adding new documents but I'm stuck with an exception.
> > Basically I copied some code from KMeansDriver, and I execute the
> > clusterDataSeq method.
> > I have seen that the clusterDataSeq accepts a clusterIn Path parameter
> that
> > should be the path that contains already generated clusters.
> > Am I right?
> >
> > When it try to emitPointToNearestCluster and in particular it calculate
> the
> > distance a CardinalityException is thrown:
> > what does it mean?
> >
> > BTW I'm creating the vector getting documents from a Lucene index.
> >
> > On Wed, Nov 24, 2010 at 5:00 PM, Jeff Eastman <jeastman@narus.com>
> wrote:
> >
> > > Note that the clustering drivers all have a static clusterData() method
> > to
> > > run just the clustering (classification) of points. You would have to
> > call
> > > this from your own driver as the current CLI does not offer just this
> > > option, but something like this should work:
> > >
> > > - Input documents are vectorized into sequence files which have
> > timestamps
> > > so you know when to delete documents which have aged
> > > - Run full clustering over all remaining documents to produce
> clusters-n
> > > and clusteredPoints. This is the batch job over the entire corpus.
> > > - As new documents are received, use the clusterData() method to
> classify
> > > them using the previous clusters-n. This can be run using -xm
> sequential
> > so
> > > it is all done in memory.
> > > - Periodically, add all the new documents to the corpus, delete any
> which
> > > have aged out of your time window, and start over
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Divya [mailto:divya@k2associates.com.sg]
> > > Sent: Tuesday, November 23, 2010 6:32 PM
> > > To: user@mahout.apache.org
> > > Subject: RE: (Near) Realtime clustering
> > >
> > > Hi,
> > >
> > > Even I also have similar requirement.
> > > Can some one please provide me the steps of hybrid approach.
> > >
> > >
> > > Regards,
> > > Divya
> > >
> > > -----Original Message-----
> > > From: Jeff Eastman [mailto:jeastman@Narus.com]
> > > Sent: Wednesday, November 24, 2010 2:19 AM
> > > To: user@mahout.apache.org
> > > Subject: RE: (Near) Realtime clustering
> > >
> > > I'd suggest a hybrid approach: Run the batch clustering periodically
> over
> > > the entire corpus to update the cluster centers and then use those
> > centers
> > > for real-time clustering (classification) of new documents as they
> > arrive.
> > > You can use the sequential execution mode of the clustering job to
> > classify
> > > documents in real-time. This will suffer from the fact that new news
> > topics
> > > will not immediately materialize new clusters until the batch job runs
> > > again.
> > >
> > > -----Original Message-----
> > > From: Gustavo Fernandes [mailto:gustavonalle@gmail.com]
> > > Sent: Tuesday, November 23, 2010 9:58 AM
> > > To: user@mahout.apache.org
> > > Subject: (Near) Realtime clustering
> > >
> > > Hello, we have a mission to implement a system to cluster news articles
> > in
> > > near real time mode. We have a large amount of articles (millions), and
> > we
> > > started using k-means to created clusters based on a fixed value of
> "k".
> > > The
> > > problem is that we have a constant incoming flow of news articles and
> we
> > > can't afford to rely on a batch process, we need to be able to present
> > > users
> > > clustered articles as soon as they arrive in our database. So far our
> > > clusters are saved into a SequenceFile, as normally output by k-means
> > > driver.
> > > What would be the recommended way of approaching this problem with
> > Mahout?
> > > Is it possible to manipulate the generated clusters and incrementally
> add
> > > new articles to them, or even forming new clusters without incurring
> the
> > > penalty of recalculating for every vector again? Is starting with
> k-means
> > > the right way? What would be the right combination of algorithms to
> > provide
> > > incremental and fast clustering calculation?
> > >
> > > TIA,
> > > Gustavo
> > >
> > >
> >
>
Mime
View raw message