lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Arabic Analyzer: possible bug
Date Fri, 09 Oct 2009 00:22:31 GMT
Basem, yeah such an analyzer, that could somehow do something nice with this
transliterated arabic chat, I think it would be a cool feature for forums
and such in the future.

On Thu, Oct 8, 2009 at 4:33 PM, Basem Narmok <narmok@gmail.com> wrote:

> Robert,
>
> Yes, this issue will not work, as some numbers are used to represent
> (transliterate if I may say) some English letters (e.g. 3 for Arabic
> Aeen, and 7 for Arabic H'a).
>
> Some online services provide instant translation for such
> transliteration (e.g. http://www.yamli.com/ try this word "7elo" it
> means nice/cool in Arabic), so we may provide analyzer stage that
> could translate such content to Arabic :)
>
> Basem
>
> On Thu, Oct 8, 2009 at 5:11 PM, Robert Muir <rcmuir@gmail.com> wrote:
> > Uwe, I might add to what you say. I do disagree a bit and think mixed
> > english/arabic text is pretty common (aside from the "product name" issue
> > you discussed).
> >
> > this can get really complex for some informal text: you have maybe some
> > english, arabic, and arabic written in informal romanization, sometimes
> all
> > mixed together:
> >
> > Example:
> > http://www.mahjoob.com/en/forums/showthread.php?t=211597&page=3
> >
> > Not really sure how to make the default ArabicAnalyzer to meet everyone's
> > needs, in this example its gonna screw up the romanized arabic, because
> they
> > use numerics for some letters, and it uses something based on
> CharTokenizer
> > :) But allowing a word to say, start with or contain a numeric, this
> might
> > not be the best thing for higher-quality text...
> >
> >
> > On Thu, Oct 8, 2009 at 9:56 AM, Uwe Schindler <uwe@thetaphi.de> wrote:
> >>
> >> I think the idea of lowercase filter in the arabic analyzers is not to
> >> really index mixed language texts. It is more for the case, if you have
> >> some
> >> word between the Arabic content (like product names,.), which happens
> >> often.
> >> You see this often also in Japanese texts. And for these embedded
> English
> >> fragments you really need no stop word list. And if there is a stop word
> >> in
> >> it, for the target language it is not a real stop word, it may be
> >> additional
> >> information. Stop word removal is done mostly because of they are
> needless
> >> (appear in every text). But if you have one Arabic sentence where "the"
> >> also
> >> appears next to an English word, it is more important than all the "the"
> >> in
> >> this mail.
> >>
> >>
> >> Uwe
> >>
> >> -----
> >> Uwe Schindler
> >> H.-H.-Meier-Allee 63, D-28213 Bremen
> >> http://www.thetaphi.de
> >> eMail: uwe@thetaphi.de
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Mime
View raw message