mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/
Date Fri, 13 Nov 2009 18:35:14 GMT
Hi Ted,

On Nov 3, 2009, at 6:37pm, Ted Dunning wrote:

> I would opt for the most specific tokenization that is feasible (no
> stemming, as much compounding as possible).

By "as much compounding as possible", do you mean you want the  
tokenizer to do as much splitting as possible, or as little?

E.g. "super-duper" should be left as-is, or turned into "super" and  
"duper"?

Is there a particular configuration of Lucene tokenizers that you'd  
suggest?

Thanks,

-- Ken


> The rationale for this is that
> stemming and uncompounding can be added by linear transformations of  
> the
> matrix at any time.
>
> The only serious issue with this is the problem of overlapping  
> compound
> words.
>
> On Tue, Nov 3, 2009 at 2:39 PM, Ken Krugler <kkrugler_lists@transpac.com 
> >wrote:
>
>> I assume there would also be an issue of which tokenizer to use to  
>> create
>> the terms from the text.
>>
>> And possibly issues around storing separate vectors for (at least)  
>> title
>> vs. content?
>>
>> Anybody have input on either of these?
>>
>>

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Mime
View raw message