lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: Bad behaviors of FrenchAnalyzer
Date Tue, 11 Oct 2005 18:00:31 GMT

On Oct 11, 2005, at 10:04 AM, Hugo Lafayette wrote:

> First of all, add maybe I make a false assumption here, but if you  
> strip
> leading "j'", "t'" and so on, that means that if you make a search  
> like:
>
>  +text:"il m'aime"
>
> you will get documents with the sentence "il m'aime" (french for "he
> loves me") and document with the sentence "il t'aime" (french for "he
> loves you"), which is wrong, right ?

I don't speak French, and I can't tell you whether  
Lingua::Stem::Snowball strips m' and t' -- the docs say "This method  
strips 's (english) and l', d', ... (french)."

That's a compelling example you have there, though, so I would hope  
not.  Conceptually, I would want the search to focus on the  
relatively rare word for "love" rather than on the pronouns.   
However, if the stemmer strips the pronouns, "m'aime" and "t'aime"  
would be conflated, which is as you say, "wrong". :)  Is "aime" ever  
used in isolation, or is it always hitched to a pronoun?

> So if this is correct, this is why I need to index both "m" and "aime"
> as distinct tokens.
>
> And I guess this is why "O'Reilly" is not splitted by the
> StandardAnalyzer, since you don't want to find the documents  
> containing
> "N'Reilly".

Actually, the reason is that you wouldn't want to conflate searches  
for "Reilly" and "O'Reilly".  Further processing of a token falls  
under the rubric of stemming.

> For a more general purpose, I am a native french speaker, but I'm not
> sure there are some cases where a string with an apostrophe has to be
> split into two (real) searchable tokens. I know the word "aujourd'hui"
> (french for "today"), but it's  likely a complete word by itself which
> does not need to be splitted again.

So you wouldn't need a search for "aujourd" or "hui" to turn up  
documents which contain "aujourd'hui"?  Very good.

But then, what about "t'aime"?  If a search for "aime" should match  
documents which contain "t'aime", then that's our problematic  
example.  You wouldn't care about searching for a pronoun -- EXCEPT  
when trying to match a phrase. If that's the case, then the  
StandardTokenizer may in fact be inadequate for French -- "t'aime"  
should be broken up into two tokens: "t" and "aime".

> If this is important to you, I could look further, and ask some french
> linguists help.

I'm asking because a new version of my own search engine library has  
a default tokenizer which keeps apostrophic strings together (like  
StandardTokenizer), and I want to be aware of cases where this choice  
causes problems.  However, it's unlikely I'll change that behavior,  
as the problem is addressed by making it trivially easy to customize  
the tokenizer.  So I would say that for my own purposes, consulting a  
linguist is probably overkill.

Cheers,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message