Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 68412 invoked from network); 8 Oct 2009 20:23:36 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 8 Oct 2009 20:23:36 -0000 Received: (qmail 25455 invoked by uid 500); 8 Oct 2009 20:23:35 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 25372 invoked by uid 500); 8 Oct 2009 20:23:35 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 25364 invoked by uid 99); 8 Oct 2009 20:23:35 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Oct 2009 20:23:35 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of narmok@gmail.com designates 209.85.218.222 as permitted sender) Received: from [209.85.218.222] (HELO mail-bw0-f222.google.com) (209.85.218.222) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Oct 2009 20:23:25 +0000 Received: by bwz22 with SMTP id 22so5633480bwz.5 for ; Thu, 08 Oct 2009 13:23:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=c6hewg/nTHAxbDFw3/3BogpKFHsRCqjQbJmRgmljARw=; b=F8couT8OPaqA4qD6nak+p0TrRysIU8FqZATBkkBQmgJqbxY3/oyvyO8dq6fgrMGKmR 4lauVNiNih6IG7nImlSwQ1Fdn2UaTNSkyFZIOIR6BNgETSJB4r9KJfkTuXtu6BdDbDe5 qilSLQh6CWIf9fSgdGU7wsoFfeZcbeSn27nmQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=gQDtYF0B3A2XfQYtktwjhoJ+wSXDmwKYiYbLS+d3Bz9uoFQ9l5g2PXzZNzCxE5atG1 O1L10qSL/FqAvz/axaNZrUhuaOAziI1QTQcVou+YVGJNviLuNw0BUCNLkfg/pze6yfl5 QsFkQp6Q4jfe9+qbk9QGT6OqVcKodjEhI1cZQ= MIME-Version: 1.0 Received: by 10.204.8.13 with SMTP id f13mr1398793bkf.150.1255033385148; Thu, 08 Oct 2009 13:23:05 -0700 (PDT) In-Reply-To: <8f0ad1f30910081051s7792895eu92b267994315d3b3@mail.gmail.com> References: <88769BA6-B709-42AB-97C4-F0A8C54FD339@gmail.com> <8f0ad1f30910080537q3a7c07dqa4c335ef1db04232@mail.gmail.com> <350203C8-BB8E-4C66-824A-3F2634800BDA@gmail.com> <66323efb0910080620i61a2a3d0n3dc1b34fedddbf6c@mail.gmail.com> <8f0ad1f30910080628p5f0f245cj73bd6e5a419cf4c3@mail.gmail.com> <66323efb0910080632r23796bf1me8be785c42caaa4b@mail.gmail.com> <8f0ad1f30910081051s7792895eu92b267994315d3b3@mail.gmail.com> Date: Thu, 8 Oct 2009 23:23:05 +0300 Message-ID: <66323efb0910081323w69e4efe6q2e96df816c1c954a@mail.gmail.com> Subject: Re: Arabic Analyzer: possible bug From: Basem Narmok To: java-dev@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Ok, the list is ready (initial one, as I will continue enhancing it). I will create JIRA issue and send the patch. Also, I have some small changes to the normalization (e.g. removing some diacritics, and other changes) Best, Basem On Thu, Oct 8, 2009 at 8:51 PM, Robert Muir wrote: > Basem, I really appreciate your time if you are able to do this. > > Its been my hope that introducing Arabic/Farsi support will create enough > interest to encourage more qualified people to come and really make thing= s > nice. > > If you don't mind, you can look at > http://wiki.apache.org/lucene-java/HowToContribute and create a JIRA Issu= e > with a patch file to improve our stopwords list. > > Otherwise, in my opinion a good list is also acceptable and I will volunt= eer > to turn it into a patch :) > > On Thu, Oct 8, 2009 at 9:32 AM, Basem Narmok wrote: >> >> Robert, >> >> I will be happy to do so. Currently, I am testing the new Arabic >> analyzer in 2.9, and also I will prepare a new stop word list. I will >> provide you with my findings/comments soon. >> >> Best, >> >> On Thu, Oct 8, 2009 at 4:28 PM, Robert Muir wrote: >> > Basem, by any chance would you be willing to help improve it for us? >> > >> > On Thu, Oct 8, 2009 at 9:20 AM, Basem Narmok wrote: >> >> >> >> DM, there is no upper/lower cases in Arabic, so don't worry, but the >> >> stop word list needs some corrections and may miss some common/stop >> >> Arabic words. >> >> >> >> Best, >> >> >> >> On Thu, Oct 8, 2009 at 4:14 PM, DM Smith wrote= : >> >> > Robert, >> >> > Thanks for the info. >> >> > As I said, I am illiterate in Arabic. So I have another, perhaps >> >> > nonsensical, question: >> >> > Does the stop word list have every combination of upper/lower case >> >> > for >> >> > each >> >> > Arabic word in the list? (i.e. is it fully de-normalized?) Or shoul= d >> >> > it >> >> > come >> >> > after LowerCaseFilter? >> >> > -- DM >> >> > On Oct 8, 2009, at 8:37 AM, Robert Muir wrote: >> >> > >> >> > DM, this isn't a bug. >> >> > >> >> > The arabic stopwords are not normalized. >> >> > >> >> > but for persian, i normalized the stopwords. mostly because i did n= ot >> >> > want >> >> > to have to create variations with farsi yah versus arabic yah for >> >> > each >> >> > one. >> >> > >> >> > On Thu, Oct 8, 2009 at 7:24 AM, DM Smith >> >> > wrote: >> >> >> >> >> >> I'm wondering if there is =A0a bug in ArabicAnalyzer in 2.9. (I do= n't >> >> >> know >> >> >> Arabic or Farsi, but have some texts to index in those languages.) >> >> >> The tokenizer/filter chain for ArabicAnalyzer is: >> >> >> =A0=A0 =A0 =A0 =A0TokenStream result =3D=A0new=A0ArabicLetterToken= izer( reader ); >> >> >> =A0=A0 =A0 =A0 =A0result =3D=A0new=A0StopFilter( result,=A0stoptab= le=A0); >> >> >> =A0=A0 =A0 =A0 =A0result =3D=A0new=A0LowerCaseFilter(result); >> >> >> =A0=A0 =A0 =A0 =A0result =3D=A0new=A0ArabicNormalizationFilter( re= sult ); >> >> >> =A0=A0 =A0 =A0 =A0result =3D=A0new=A0ArabicStemFilter( result ); >> >> >> >> >> >> =A0=A0 =A0 =A0 =A0return=A0result; >> >> >> >> >> >> Shouldn't the StopFilter come after ArabicNormalizationFilter? >> >> >> >> >> >> As a comparison the PersianAnalyzer has: >> >> >> =A0=A0 =A0TokenStream result =3D=A0new=A0ArabicLetterTokenizer(rea= der); >> >> >> =A0=A0 =A0result =3D=A0new=A0LowerCaseFilter(result); >> >> >> =A0=A0 =A0result =3D=A0new=A0ArabicNormalizationFilter(result); >> >> >> =A0=A0 =A0/* additional persian-specific normalization */ >> >> >> =A0=A0 =A0result =3D=A0new=A0PersianNormalizationFilter(result); >> >> >> =A0=A0 =A0/* >> >> >> =A0=A0 =A0=A0* the order here is important: the stopword list is n= ormalized >> >> >> with >> >> >> the >> >> >> =A0=A0 =A0=A0* above! >> >> >> =A0=A0 =A0=A0*/ >> >> >> =A0=A0 =A0result =3D=A0new=A0StopFilter(result,=A0stoptable); >> >> >> >> >> >> =A0=A0 =A0return=A0result; >> >> >> >> >> >> >> >> >> Thanks, >> >> >> DM >> >> > >> >> > >> >> > -- >> >> > Robert Muir >> >> > rcmuir@gmail.com >> >> > >> >> > >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org >> >> For additional commands, e-mail: java-dev-help@lucene.apache.org >> >> >> > >> > >> > >> > -- >> > Robert Muir >> > rcmuir@gmail.com >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-dev-help@lucene.apache.org >> > > > > -- > Robert Muir > rcmuir@gmail.com > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org