Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of narmok@gmail.com designates
 209.85.218.222 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=gQDtYF0B3A2XfQYtktwjhoJ+wSXDmwKYiYbLS+d3Bz9uoFQ9l5g2PXzZNzCxE5atG1
         O1L10qSL/FqAvz/axaNZrUhuaOAziI1QTQcVou+YVGJNviLuNw0BUCNLkfg/pze6yfl5
         QsFkQp6Q4jfe9+qbk9QGT6OqVcKodjEhI1cZQ=
MIME-Version: 1.0
In-Reply-To: <8f0ad1f30910081051s7792895eu92b267994315d3b3@mail.gmail.com>
References: <88769BA6-B709-42AB-97C4-F0A8C54FD339@gmail.com>
	 <8f0ad1f30910080537q3a7c07dqa4c335ef1db04232@mail.gmail.com>
	 <350203C8-BB8E-4C66-824A-3F2634800BDA@gmail.com>
	 <66323efb0910080620i61a2a3d0n3dc1b34fedddbf6c@mail.gmail.com>
	 <8f0ad1f30910080628p5f0f245cj73bd6e5a419cf4c3@mail.gmail.com>
	 <66323efb0910080632r23796bf1me8be785c42caaa4b@mail.gmail.com>
	 <8f0ad1f30910081051s7792895eu92b267994315d3b3@mail.gmail.com>
Date: Thu, 8 Oct 2009 23:23:05 +0300
Message-ID: <66323efb0910081323w69e4efe6q2e96df816c1c954a@mail.gmail.com>
Subject: Re: Arabic Analyzer: possible bug
From: Basem Narmok <narmok@gmail.com>
To: java-dev@lucene.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Ok, the list is ready (initial one, as I will continue enhancing it).
I will create JIRA issue and send the patch.

Also, I have some small changes to the normalization (e.g. removing
some diacritics, and other changes)

Best,
Basem

On Thu, Oct 8, 2009 at 8:51 PM, Robert Muir <rcmuir@gmail.com> wrote:
> Basem, I really appreciate your time if you are able to do this.
>
> Its been my hope that introducing Arabic/Farsi support will create enough
> interest to encourage more qualified people to come and really make thing=
s
> nice.
>
> If you don't mind, you can look at
> http://wiki.apache.org/lucene-java/HowToContribute and create a JIRA Issu=
e
> with a patch file to improve our stopwords list.
>
> Otherwise, in my opinion a good list is also acceptable and I will volunt=
eer
> to turn it into a patch :)
>
> On Thu, Oct 8, 2009 at 9:32 AM, Basem Narmok <narmok@gmail.com> wrote:
>>
>> Robert,
>>
>> I will be happy to do so. Currently, I am testing the new Arabic
>> analyzer in 2.9, and also I will prepare a new stop word list. I will
>> provide you with my findings/comments soon.
>>
>> Best,
>>
>> On Thu, Oct 8, 2009 at 4:28 PM, Robert Muir <rcmuir@gmail.com> wrote:
>> > Basem, by any chance would you be willing to help improve it for us?
>> >
>> > On Thu, Oct 8, 2009 at 9:20 AM, Basem Narmok <narmok@gmail.com> wrote:
>> >>
>> >> DM, there is no upper/lower cases in Arabic, so don't worry, but the
>> >> stop word list needs some corrections and may miss some common/stop
>> >> Arabic words.
>> >>
>> >> Best,
>> >>
>> >> On Thu, Oct 8, 2009 at 4:14 PM, DM Smith <dmsmith555@gmail.com> wrote=
:
>> >> > Robert,
>> >> > Thanks for the info.
>> >> > As I said, I am illiterate in Arabic. So I have another, perhaps
>> >> > nonsensical, question:
>> >> > Does the stop word list have every combination of upper/lower case
>> >> > for
>> >> > each
>> >> > Arabic word in the list? (i.e. is it fully de-normalized?) Or shoul=
d
>> >> > it
>> >> > come
>> >> > after LowerCaseFilter?
>> >> > -- DM
>> >> > On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:
>> >> >
>> >> > DM, this isn't a bug.
>> >> >
>> >> > The arabic stopwords are not normalized.
>> >> >
>> >> > but for persian, i normalized the stopwords. mostly because i did n=
ot
>> >> > want
>> >> > to have to create variations with farsi yah versus arabic yah for
>> >> > each
>> >> > one.
>> >> >
>> >> > On Thu, Oct 8, 2009 at 7:24 AM, DM Smith <dmsmith555@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> I'm wondering if there is =A0a bug in ArabicAnalyzer in 2.9. (I do=
n't
>> >> >> know
>> >> >> Arabic or Farsi, but have some texts to index in those languages.)
>> >> >> The tokenizer/filter chain for ArabicAnalyzer is:
>> >> >> =A0=A0 =A0 =A0 =A0TokenStream result =3D=A0new=A0ArabicLetterToken=
izer( reader );
>> >> >> =A0=A0 =A0 =A0 =A0result =3D=A0new=A0StopFilter( result,=A0stoptab=
le=A0);
>> >> >> =A0=A0 =A0 =A0 =A0result =3D=A0new=A0LowerCaseFilter(result);
>> >> >> =A0=A0 =A0 =A0 =A0result =3D=A0new=A0ArabicNormalizationFilter( re=
sult );
>> >> >> =A0=A0 =A0 =A0 =A0result =3D=A0new=A0ArabicStemFilter( result );
>> >> >>
>> >> >> =A0=A0 =A0 =A0 =A0return=A0result;
>> >> >>
>> >> >> Shouldn't the StopFilter come after ArabicNormalizationFilter?
>> >> >>
>> >> >> As a comparison the PersianAnalyzer has:
>> >> >> =A0=A0 =A0TokenStream result =3D=A0new=A0ArabicLetterTokenizer(rea=
der);
>> >> >> =A0=A0 =A0result =3D=A0new=A0LowerCaseFilter(result);
>> >> >> =A0=A0 =A0result =3D=A0new=A0ArabicNormalizationFilter(result);
>> >> >> =A0=A0 =A0/* additional persian-specific normalization */
>> >> >> =A0=A0 =A0result =3D=A0new=A0PersianNormalizationFilter(result);
>> >> >> =A0=A0 =A0/*
>> >> >> =A0=A0 =A0=A0* the order here is important: the stopword list is n=
ormalized
>> >> >> with
>> >> >> the
>> >> >> =A0=A0 =A0=A0* above!
>> >> >> =A0=A0 =A0=A0*/
>> >> >> =A0=A0 =A0result =3D=A0new=A0StopFilter(result,=A0stoptable);
>> >> >>
>> >> >> =A0=A0 =A0return=A0result;
>> >> >>
>> >> >>
>> >> >> Thanks,
>> >> >> DM
>> >> >
>> >> >
>> >> > --
>> >> > Robert Muir
>> >> > rcmuir@gmail.com
>> >> >
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>
>> >
>> >
>> >
>> > --
>> > Robert Muir
>> > rcmuir@gmail.com
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org