lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Basem Narmok <nar...@gmail.com>
Subject Re: Arabic Analyzer: possible bug
Date Thu, 08 Oct 2009 20:33:17 GMT
Robert,

Yes, this issue will not work, as some numbers are used to represent
(transliterate if I may say) some English letters (e.g. 3 for Arabic
Aeen, and 7 for Arabic H'a).

Some online services provide instant translation for such
transliteration (e.g. http://www.yamli.com/ try this word "7elo" it
means nice/cool in Arabic), so we may provide analyzer stage that
could translate such content to Arabic :)

Basem

On Thu, Oct 8, 2009 at 5:11 PM, Robert Muir <rcmuir@gmail.com> wrote:
> Uwe, I might add to what you say. I do disagree a bit and think mixed
> english/arabic text is pretty common (aside from the "product name" issue
> you discussed).
>
> this can get really complex for some informal text: you have maybe some
> english, arabic, and arabic written in informal romanization, sometimes all
> mixed together:
>
> Example:
> http://www.mahjoob.com/en/forums/showthread.php?t=211597&page=3
>
> Not really sure how to make the default ArabicAnalyzer to meet everyone's
> needs, in this example its gonna screw up the romanized arabic, because they
> use numerics for some letters, and it uses something based on CharTokenizer
> :) But allowing a word to say, start with or contain a numeric, this might
> not be the best thing for higher-quality text...
>
>
> On Thu, Oct 8, 2009 at 9:56 AM, Uwe Schindler <uwe@thetaphi.de> wrote:
>>
>> I think the idea of lowercase filter in the arabic analyzers is not to
>> really index mixed language texts. It is more for the case, if you have
>> some
>> word between the Arabic content (like product names,.), which happens
>> often.
>> You see this often also in Japanese texts. And for these embedded English
>> fragments you really need no stop word list. And if there is a stop word
>> in
>> it, for the target language it is not a real stop word, it may be
>> additional
>> information. Stop word removal is done mostly because of they are needless
>> (appear in every text). But if you have one Arabic sentence where "the"
>> also
>> appears next to an English word, it is more important than all the "the"
>> in
>> this mail.
>>
>>
>> Uwe
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message