lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konrad Scherer <bcdh...@uottawa.ca>
Subject StandardFilter that works for French
Date Thu, 21 Nov 2002 20:12:26 GMT
Hello all,

I am using Lucene to index both English and French documents and have run 
into some problems with the analysis of the text. The project I am working 
with is using the searches to do language analysis so this may not be 
relevant to some people. Here is a quick explanation.

In French you have 6 words (me, te, se, le/la , ne, de) where the e is 
replaced with an apostrophe when the following word starts with a vowel. 
For example me aider becomes m'aider. Currently Lucene indexes m'aider, 
s'aider, n'aider as different words when in fact they should be analyzed as 
me aider, se aider, ne aider, etc. So I modified Standard filter to send 
back these words as two words. I had to add a one Token buffer. I toyed 
with modifying StandardTokenizer.jj but I was worried about unintended 
changes in behavior.

This change will not effect English indexing. The only change I can think 
of is that a word like m'lord would be indexed as "me lord". Still it might 
be better to make a French package and add this to a French Filter.

I hope this is useful to anyone working with French.
All the best.

Konrad
Mime
View raw message