Hello all,
I am using Lucene to index both English and French documents and have run
into some problems with the analysis of the text. The project I am working
with is using the searches to do language analysis so this may not be
relevant to some people. Here is a quick explanation.
In French you have 6 words (me, te, se, le/la , ne, de) where the e is
replaced with an apostrophe when the following word starts with a vowel.
For example me aider becomes m'aider. Currently Lucene indexes m'aider,
s'aider, n'aider as different words when in fact they should be analyzed as
me aider, se aider, ne aider, etc. So I modified Standard filter to send
back these words as two words. I had to add a one Token buffer. I toyed
with modifying StandardTokenizer.jj but I was worried about unintended
changes in behavior.
This change will not effect English indexing. The only change I can think
of is that a word like m'lord would be indexed as "me lord". Still it might
be better to make a French package and add this to a French Filter.
I hope this is useful to anyone working with French.
All the best.
Konrad
|