lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Basem Narmok <>
Subject Re: Arabic Analyzer: possible bug
Date Thu, 08 Oct 2009 20:33:17 GMT

Yes, this issue will not work, as some numbers are used to represent
(transliterate if I may say) some English letters (e.g. 3 for Arabic
Aeen, and 7 for Arabic H'a).

Some online services provide instant translation for such
transliteration (e.g. try this word "7elo" it
means nice/cool in Arabic), so we may provide analyzer stage that
could translate such content to Arabic :)


On Thu, Oct 8, 2009 at 5:11 PM, Robert Muir <> wrote:
> Uwe, I might add to what you say. I do disagree a bit and think mixed
> english/arabic text is pretty common (aside from the "product name" issue
> you discussed).
> this can get really complex for some informal text: you have maybe some
> english, arabic, and arabic written in informal romanization, sometimes all
> mixed together:
> Example:
> Not really sure how to make the default ArabicAnalyzer to meet everyone's
> needs, in this example its gonna screw up the romanized arabic, because they
> use numerics for some letters, and it uses something based on CharTokenizer
> :) But allowing a word to say, start with or contain a numeric, this might
> not be the best thing for higher-quality text...
> On Thu, Oct 8, 2009 at 9:56 AM, Uwe Schindler <> wrote:
>> I think the idea of lowercase filter in the arabic analyzers is not to
>> really index mixed language texts. It is more for the case, if you have
>> some
>> word between the Arabic content (like product names,.), which happens
>> often.
>> You see this often also in Japanese texts. And for these embedded English
>> fragments you really need no stop word list. And if there is a stop word
>> in
>> it, for the target language it is not a real stop word, it may be
>> additional
>> information. Stop word removal is done mostly because of they are needless
>> (appear in every text). But if you have one Arabic sentence where "the"
>> also
>> appears next to an English word, it is more important than all the "the"
>> in
>> this mail.
>> Uwe
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> eMail:
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> --
> Robert Muir

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message