lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joshua O'Madadhain" <>
Subject Re: StandardFilter that works for French
Date Thu, 21 Nov 2002 20:59:35 GMT
On Thu, 21 Nov 2002, Konrad Scherer wrote:

> In French you have 6 words (me, te, se, le/la , ne, de) where the e is
> replaced with an apostrophe when the following word starts with a vowel.
> For example me aider becomes m'aider. Currently Lucene indexes m'aider,
> s'aider, n'aider as different words when in fact they should be analyzed as
> me aider, se aider, ne aider, etc. So I modified Standard filter to send
> back these words as two words. I had to add a one Token buffer. I toyed
> with modifying StandardTokenizer.jj but I was worried about unintended
> changes in behavior.
> This change will not effect English indexing. The only change I can think
> of is that a word like m'lord would be indexed as "me lord". Still it might
> be better to make a French package and add this to a French Filter.

There are a number of contractions in English that could be affected if
you're using the apostrophe as a marker, e.g.: isn't, wouldn't, I'd, he's,
hasn't.  (Granted, these are often considered stop words.)  Thus, I think
that your idea of incorporating this change into a French filter, rather
than modifying Standard filter, is a good idea.

Joshua O'Madadhain Per
   Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning comprehension that I live for.  -- Bill Watterson
 My opinions are too rational and insightful to be those of any organization.

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message