lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Bad behaviors of FrenchAnalyzer
Date Tue, 11 Oct 2005 16:20:51 GMT

On Oct 11, 2005, at 10:52 AM, Hugo Lafayette wrote:
> Erik Hatcher wrote:
>
>
>> Rather than changing StandardAnalyzer, you could create a custom
>> Analyzer that is something along the lines of StandardTokenizer  ->
>> custom apostrophe splitting filter -> ISOLatinFilter.
>>
>
> Why do not include that in the FrenchStemFilter "next()" method  
> itself ?
> It will be a bad design ?

I've not personally used the FrenchStemFilter, so I cannot comment on  
its behavior at all.  I'm out of my league in that realm.

> And I'm quite concerned with performance issue, but it seem's to me  
> that
> your solution will only affect "APOSTROPHE" typed token, so the  
> overhead
> will be unexistant, right ?

There is little need to be concerned with analyzer performance, at  
least at this stage.  First have a problem, then optimize for it.  I  
don't speculate with performance.  But yes, only the apostrophe type  
(whatever that is, I'm not looking at the code now, but I think its  
"<APOSTROPHE>", with angle brackets) would need to be caught and  
split, the rest could pass straight through.  Again, look at the  
StandardTokenFilter for an example - it removes apostrophes.

>> You get a special type for words with interior apostrophes from
>> StandardTokenizer (look at StandardFilter to see how that works). You
>> could create a simple TokenFilter that splits apostrophe'd tokens
>> into two.
>>
>
> I'm not sure to figure out to do that efficiently. Is it something  
> like
> that ? :
>
> <code>
>
> private Stack subTokens; //previously initialized
>
> public final Token next() throws IOException {
>   Token t = null;
>   if (subTokens != null && !subTokens.empty) {
>     t = subTokens.pop();
>   } else {
>     t = input.next();
>     if (t != null)
>     {
>       String type = t.type();
>       if (type == APOSTROPHE_TYPE) {
>     tokenizeApostrophe(t, subTokens);
>       }
>     }
>   }
>   return t;
> }
>
> </code>
>
> with "tokenizeApostrophe(Token, Stack)" that split on conditions the
> token into 2 others, and push them on the stack.

Using a stack (or only a single spare Token if you will only split  
into two pieces) is a good appraoch.  I haven't tried your code, but  
I recommend writing some unit tests that exercise your filter  
separately and ensure it works to split tokens as you expect.  :)

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message