lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Bad behaviors of FrenchAnalyzer
Date Tue, 11 Oct 2005 14:13:34 GMT

On Oct 11, 2005, at 9:22 AM, Hugo Lafayette wrote:
> - accentuated characters: The french analyzer keep accents, which  
> could
> be useful, but may also become boring. I just have to add the
> ISOLatinFilter.java to correct that, but maybe adding an option to  
> keep
> them or not could be useful.
>
> - apsotrophe (') characters: The standard analyzer does NOT  
> tokenize on
> ('), because of O'Reilly like words. But in french, lot's of  
> expression
> must be tokenize, like "j'aime" or "l'amour" which contains  
> respectively
> 2 tokens each ("je" & "aime", "le" & "amour"). I'm quite surprised  
> that
> nobody else found that supicious behavior before, so maybe I missed
> something.

Rather than changing StandardAnalyzer, you could create a custom  
Analyzer that is something along the lines of StandardTokenizer  ->  
custom apostrophe splitting filter -> ISOLatinFilter.  You get a  
special type for words with interior apostrophes from  
StandardTokenizer (look at StandardFilter to see how that works).   
You could create a simple TokenFilter that splits apostrophe'd tokens  
into two.  Maybe it's simple enough also to expand "j" and "l" into  
"je" and "le" in the same step too?

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message