lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hugo Lafayette <hugo.lafaye...@temis-group.com>
Subject Bad behaviors of FrenchAnalyzer
Date Tue, 11 Oct 2005 13:22:36 GMT
Hi there,

I just test the french analyzer, which works well for most part of it
(Stemmer particulary). But ATM, I have two unexpected behavior with the
default configuration:

- accentuated characters: The french analyzer keep accents, which could
be useful, but may also become boring. I just have to add the
ISOLatinFilter.java to correct that, but maybe adding an option to keep
them or not could be useful.

- apsotrophe (') characters: The standard analyzer does NOT tokenize on
('), because of O'Reilly like words. But in french, lot's of expression
must be tokenize, like "j'aime" or "l'amour" which contains respectively
2 tokens each ("je" & "aime", "le" & "amour"). I'm quite surprised that
nobody else found that supicious behavior before, so maybe I missed
something.

Anyway I don't know how to proceed, since I have to index both english
and french text.

The simple way will be to change the standard analyzer grammar (remove
the APOSTROPHE rules basically), to get 2 tokens. But I'm afraid of
unexpected side effects.

The other way will be to make the french analyzer further tokenize
"j'aime" into 2 sub tokens (with a token buffer, right ?). Is it the
right thing to do ? Does this represent a bug that will be corrected
soon ? Is there other way around ?

Thanks in advance for your answers, and congrats for your delightful
software !


PS: I'm working with "lucene-1.9-rc1-dev" version from the svn repository.

-- 
Hugo

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message