lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Endre StĂžlsvik <En...@Stolsvik.com>
Subject Re: Single Analyzer for multiple European languages
Date Tue, 27 Sep 2005 10:38:12 GMT
On Mon, 26 Sep 2005, Andrzej Bialecki wrote:

| Shashikant Kore wrote:
| 
| > Search:
| > - Get the superset of stopwords by merging the stopwords from all the
| > languages.
| 
| This step doesn't make sense. Stopwords ARE language specific. A stopword in
| one language may be a valid content word in another language - e.g. English
| stopwords "is, by, far" mean "ice, village, father" in Swedish. And vice
| versa, e.g. "den, men, man, sin, hans, era" are Swedish stopwords... So, if
| you mix them and apply to all documents then you will surely loose a lot of
| valid content.

Idea: Don't drop stopwords from the query - they aren't in the index for 
the respective languages. However, you might get results from other 
languages..

It's the stemming that's difficult. You'd have to stem for each language, 
and then do searches. But since you won't stem at all, this wouldn't be 
much of a problem in your case.

Maybe a language-selector drop-down would be nice?

Regards,
Endre.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message