lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Single Analyzer for multiple European languages
Date Mon, 26 Sep 2005 19:08:47 GMT
Shashikant Kore wrote:

> Search:
> - Get the superset of stopwords by merging the stopwords from all the languages.

This step doesn't make sense. Stopwords ARE language specific. A 
stopword in one language may be a valid content word in another language 
- e.g. English stopwords "is, by, far" mean "ice, village, father" in 
Swedish. And vice versa, e.g. "den, men, man, sin, hans, era" are 
Swedish stopwords... So, if you mix them and apply to all documents then 
you will surely loose a lot of valid content.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message