lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Bad behaviors of FrenchAnalyzer
Date Tue, 11 Oct 2005 16:47:02 GMT

On Oct 11, 2005, at 7:52 AM, Hugo Lafayette wrote:

> Why do not include that in the FrenchStemFilter "next()" method  
> itself ?
> It will be a bad design ?

I agree with your assessment.  Conceptually, this is a stemming  
problem.  By extension, it's not a tokenizing problem, and the  
behavior of the StandardTokenizer is fine.

The most recent Perl/CPAN version of the Snowball stemming library  
added an option to strip leading l', d', etc in French.  I know this  
because until the most recent version, it didn't strip trailing 's in  
English, either, and I had to write some workarounds, just like  
you'll have to.

I'm curious: are there any cases in French where a string with an  
apostrophe in it ought to be split into two searchable tokens?  I  
know of no such cases in English: you never want to search for the ll  
in you'll, or the O in O'Reilly, etc.

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message