lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Arabic Analyzer: possible bug
Date Thu, 08 Oct 2009 13:56:51 GMT
I think the idea of lowercase filter in the arabic analyzers is not to
really index mixed language texts. It is more for the case, if you have some
word between the Arabic content (like product names,.), which happens often.
You see this often also in Japanese texts. And for these embedded English
fragments you really need no stop word list. And if there is a stop word in
it, for the target language it is not a real stop word, it may be additional
information. Stop word removal is done mostly because of they are needless
(appear in every text). But if you have one Arabic sentence where "the" also
appears next to an English word, it is more important than all the "the" in
this mail.


Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message