mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <>
Subject Re:
Date Fri, 13 Nov 2009 18:35:14 GMT
Hi Ted,

On Nov 3, 2009, at 6:37pm, Ted Dunning wrote:

> I would opt for the most specific tokenization that is feasible (no
> stemming, as much compounding as possible).

By "as much compounding as possible", do you mean you want the  
tokenizer to do as much splitting as possible, or as little?

E.g. "super-duper" should be left as-is, or turned into "super" and  

Is there a particular configuration of Lucene tokenizers that you'd  


-- Ken

> The rationale for this is that
> stemming and uncompounding can be added by linear transformations of  
> the
> matrix at any time.
> The only serious issue with this is the problem of overlapping  
> compound
> words.
> On Tue, Nov 3, 2009 at 2:39 PM, Ken Krugler < 
> >wrote:
>> I assume there would also be an issue of which tokenizer to use to  
>> create
>> the terms from the text.
>> And possibly issues around storing separate vectors for (at least)  
>> title
>> vs. content?
>> Anybody have input on either of these?

Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g

View raw message