mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/
Date Wed, 04 Nov 2009 02:37:53 GMT
I would opt for the most specific tokenization that is feasible (no
stemming, as much compounding as possible).  The rationale for this is that
stemming and uncompounding can be added by linear transformations of the
matrix at any time.

The only serious issue with this is the problem of overlapping compound
words.

On Tue, Nov 3, 2009 at 2:39 PM, Ken Krugler <kkrugler_lists@transpac.com>wrote:

> I assume there would also be an issue of which tokenizer to use to create
> the terms from the text.
>
> And possibly issues around storing separate vectors for (at least) title
> vs. content?
>
> Anybody have input on either of these?
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message