lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hugo Lafayette <hugo.lafaye...@temis-group.com>
Subject Re: Bad behaviors of FrenchAnalyzer
Date Tue, 11 Oct 2005 17:04:01 GMT
Marvin Humphrey wrote:

> I'm curious: are there any cases in French where a string with an  
> apostrophe in it ought to be split into two searchable tokens?  I  
> know of no such cases in English: you never want to search for the ll  
> in you'll, or the O in O'Reilly, etc.

First of all, add maybe I make a false assumption here, but if you strip
leading "j'", "t'" and so on, that means that if you make a search like:

 +text:"il m'aime"

you will get documents with the sentence "il m'aime" (french for "he
loves me") and document with the sentence "il t'aime" (french for "he
loves you"), which is wrong, right ?

So if this is correct, this is why I need to index both "m" and "aime"
as distinct tokens.

And I guess this is why "O'Reilly" is not splitted by the
StandardAnalyzer, since you don't want to find the documents containing
"N'Reilly".

For a more general purpose, I am a native french speaker, but I'm not
sure there are some cases where a string with an apostrophe has to be
split into two (real) searchable tokens. I know the word "aujourd'hui"
(french for "today"), but it's  likely a complete word by itself which
does not need to be splitted again.

If this is important to you, I could look further, and ask some french
linguists help.

-- 
Hugo

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message