lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konrad Scherer <bcdh...@uottawa.ca>
Subject Re: StandardFilter that works for French
Date Thu, 21 Nov 2002 21:25:29 GMT

>
>There are a number of contractions in English that could be affected if
>you're using the apostrophe as a marker, e.g.: isn't, wouldn't, I'd, he's,
>hasn't.  (Granted, these are often considered stop words.)  Thus, I think
>that your idea of incorporating this change into a French filter, rather
>than modifying Standard filter, is a good idea.

Sorry I forgot to mention that it only looks at words where the apostrophe 
occurs in the second letter and only for words that start with the six 
magic letters m,t,s,l,n,d . If filtering the very English specific 's and 
'S possessives is good enough for the StandardFilter then why not French as 
well?  In the comments of StandardTokenizer.jj we have "This should be a 
good tokenizer for most European-language documents". Most people will use 
this one, why not have it work as well as possible? The standard tokenizer 
is very english centric and the code I posted was for those who may not be 
aware of it. I work with a lot of bilingual documents (english and french) 
and my case, this filter improves the quality of the index.
More philosophically, there probably shouldn't even be a "standard" 
analyzer, just language specific ones.
All the best

Konrad


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message