lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <>
Subject Re: Arabic Analyzer: possible bug
Date Thu, 08 Oct 2009 14:11:33 GMT
Uwe, I might add to what you say. I do disagree a bit and think mixed
english/arabic text is pretty common (aside from the "product name" issue
you discussed).

this can get really complex for some informal text: you have maybe some
english, arabic, and arabic written in informal romanization, sometimes all
mixed together:


Not really sure how to make the default ArabicAnalyzer to meet everyone's
needs, in this example its gonna screw up the romanized arabic, because they
use numerics for some letters, and it uses something based on CharTokenizer
:) But allowing a word to say, start with or contain a numeric, this might
not be the best thing for higher-quality text...

On Thu, Oct 8, 2009 at 9:56 AM, Uwe Schindler <> wrote:

> I think the idea of lowercase filter in the arabic analyzers is not to
> really index mixed language texts. It is more for the case, if you have
> some
> word between the Arabic content (like product names,.), which happens
> often.
> You see this often also in Japanese texts. And for these embedded English
> fragments you really need no stop word list. And if there is a stop word in
> it, for the target language it is not a real stop word, it may be
> additional
> information. Stop word removal is done mostly because of they are needless
> (appear in every text). But if you have one Arabic sentence where "the"
> also
> appears next to an English word, it is more important than all the "the" in
> this mail.
> Uwe
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> eMail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Robert Muir

View raw message