mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Clustering Demo
Date Sat, 24 May 2008 11:13:16 GMT

On May 23, 2008, at 2:15 PM, Karl Wettin wrote:

>
> 17 maj 2008 kl. 13.39 skrev Grant Ingersoll:
>>
>> On May 12, 2008, at 11:24 AM, Karl Wettin wrote:
>
>
> Did anyone do anything with this? If not I'll come up with something  
> in the begining of June. I think it should be abstract enough to  
> handle other similar data sources (Apache mbox archives).


This would be cool.

>
>
>
>>> In what way can we prepare so it makes as much sense for as many  
>>> things as possible we might want to show off? What class fields  
>>> can we extract from the headers except for author and thread  
>>> identity? How do we want to tokenize the text (grams of words and  
>>> charachters, stemming, stopwords, etc), do we want to seperate  
>>> quotation from author text so we can use diffrent weights to  
>>> quotation, et c?
>>
>> Let's just start simple with words and then enhance.
>
>
> It might be interesting to take a look at what sort of tokenizer  
> other libs do, the Weka StringToWordVector for instance (best viewed  
> from their GUI). We should be able to much better than that with  
> whats available in Lucene. But a default chain of token streams that  
> is easy to set up is not a bad idea.
>
> I also think we want some simple algorithmic stop word extraction.  
> There is a simple one in LUCENE-1025 with the incorrect name  
> HacGqfTermReducer.java.
>
> It would be a simple thing to support different weights for subject  
> and body. Or any other field we might extract in the future (quoted  
> body, et c).
>
> We also want to get right of signatures with quotes and what not in.  
> That should be handled by some pre-pre-processing layer though if  
> you ask me. LUCENE-725 can help out.
>
>
> Should we perhaps make this thread an issue?
>

These are interesting. Perhaps you want to commit LUCENE-725?  I was  
wondering whether we should consider asking Lucene to put up an  
Analyzer only jar (i.e. a separate jar that combiners the Analyzer/ 
TokenStream definitions with the contrib Analyzers package.)  Of  
course, we may have uses for the rest of Lucene as well, so maybe not.

Mime
View raw message