mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: CardinalityException during data clustering
Date Fri, 27 May 2011 17:47:26 GMT
The text value encoder has a special set of methods so that you can add text
that it tokenizes for you.  That is generally the easiest method.

You can tokenize it yourself and use the addToVector method if you like.
 Sometimes that is preferable because you may have a non-Lucene tokenizer or
you may want to avoid double tokenization (or a hundred other reasons).

On Fri, May 27, 2011 at 8:49 AM, David Saile <david@uni-koblenz.de> wrote:

> I really appreciate your help Ted!
>
> As I am new to mahout, could you please point me into the right direction?
>
> From looking at the code I get the impression, that I would need to use the
> TextValueEncoder class and continuously call
> addToVector(String originalForm, double weight, Vector data)
> for each word in a given document. Is this correct?
>
>
> Am 27.05.2011 um 17:26 schrieb Ted Dunning:
>
> > You have to write or adapt some code.  This is the big current down-side
> of
> > the hashing encoders.
> >
> > On Fri, May 27, 2011 at 2:38 AM, David Saile <david@uni-koblenz.de>
> wrote:
> >
> >>> The other option is to use the hashing encoders.  They inherently
> produce
> >>> output of fixed cardinality.  The down-side with that is that the
> meaning
> >> of
> >>> lots of distance measures is hard to understand in the hashed
> frameworks.
> >>> Distances that are invariant under linear transformations work
> perfectly.
> >>> Some others like Manhattan distance work pretty well.  Others can be
> >>> totally confused.
> >>
> >> This sounds like an option that eliminates the need for a global
> dictionary
> >> (in regards to multiple vecotrizer runs).
> >> How can I specify the use of hashing encoders for vectorization?
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message