Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 81159 invoked from network); 2 Aug 2009 00:17:59 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 2 Aug 2009 00:17:59 -0000 Received: (qmail 11801 invoked by uid 500); 2 Aug 2009 00:18:00 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 11709 invoked by uid 500); 2 Aug 2009 00:18:00 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 11699 invoked by uid 99); 2 Aug 2009 00:18:00 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 02 Aug 2009 00:18:00 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.219.226] (HELO mail-ew0-f226.google.com) (209.85.219.226) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 02 Aug 2009 00:17:50 +0000 Received: by ewy26 with SMTP id 26so3271667ewy.5 for ; Sat, 01 Aug 2009 17:17:26 -0700 (PDT) Received: by 10.210.129.20 with SMTP id b20mr2990379ebd.83.1249172246587; Sat, 01 Aug 2009 17:17:26 -0700 (PDT) Received: from ?62.84.92.150? ([62.84.92.150]) by mx.google.com with ESMTPS id 10sm6446772eyz.21.2009.08.01.17.17.23 (version=SSLv3 cipher=RC4-MD5); Sat, 01 Aug 2009 17:17:25 -0700 (PDT) Subject: Re: arabic analyzer From: walid To: java-user@lucene.apache.org In-Reply-To: <8f0ad1f30907240539q5f2e3dbak9b90704ded96ed77@mail.gmail.com> References: <1248348847.5882.12.camel@mine-lenovo> <8f0ad1f30907230633i4eef7bc2n58a463c446e6d4e5@mail.gmail.com> <1248422813.5499.6.camel@mine-lenovo> <8f0ad1f30907240539q5f2e3dbak9b90704ded96ed77@mail.gmail.com> Content-Type: text/plain; charset="UTF-8" Date: Sun, 02 Aug 2009 03:17:19 +0300 Message-Id: <1249172239.5261.4.camel@mine-lenovo> Mime-Version: 1.0 X-Mailer: Evolution 2.26.1 Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org I guess in that case, my users will be angry :) the fact is, plural (as an example) is not supported, and that is one of the most common things that a person doing some search will expect to not have to worry about. anyway, will roll it out and see the users' reaction :) thank you. -walid On Fri, 2009-07-24 at 08:39 -0400, Robert Muir wrote: > walid, it is true some of what you mentioned (from aramorph) works in > light stemming version, some does not. > The problem is that its not clear to me that what aramorph is doing is > really the best. > > >From the paper I sent you: > > The best stemmer in our experiments, light8-s was very simple and did > not try to find roots or take into account most of Arabic morphology. > It is probably not essential for the stemmer to yield the correct > forms, whether stems or roots. > It is sufficient for it to group most of the forms that belong together. > > This is what is being used in lucene, light8-s. If you read section > 5.2.1 of the paper, you will see this method outperforms the > morphological analysis method you speak of (using the same buckwalter > dictionary) > > But I also understand this is just a general text IR relevance > measurement (your specific text might vary), and it does not take into > account some human factors (it can be better on average, but make > users angry, that type of thing). > > Another problem I have with this situation is that I'm not sure the > morph. analysis method is really wrong, just that perhaps aramorph / > that paper might be indexing the wrong thing. For example, aramorph > indexes arabic stems, but the latest buckwalter dictionary has > lemmaID, why not index that? > > anyway, I hope in the future there will be more options, that would be > a good thing! > > On Fri, Jul 24, 2009 at 4:06 AM, walid wrote: > > We were using the aramorph library for some time and so we mapped out > > the set of features it provides, they come as follows: > > > > ---------------------------------------------------------------------------------------------------------------------------------------------------------------- > > The ء and ~ are considered unique characters. > > > > <> * أ , آ, ا, and إ are distinct > > <> * و and ؤ are distinct > > <> * ى and ئ are distinct > > > > > > > > * The ا and ة (denoting the feminine adjective) at the end of a > > word are optional. > > > > * The ال, ب, ل, ك, بال, كال, لل at the beginning of a word are > > optional > > > > * All حركات as well as the ّ (شدّة) are ignored. > > > > * The ي , و , ات , ون denoting the plural form of a word are > > optional. If the indexed word ends with a ة its plural, which > > replaces the ة with ات , is recognized. > > > > > > > > > > > > The following examples illustrate these rules: > > > > > > Indexed Word > > > > > > Search Term > > > > > > Success > > > > <> الحياة > > > > <> للحياة > > > > > > True > > > > > > > > <> حياة > > > > > > True > > > > > > > > <> حيا > > > > > > False > > > > > > > > <> ألحياة > > > > > > False > > > > > > > > <> إلحياة > > > > > > False > > > > > > > > <> كالحياة > > > > > > True > > > > > > > > <> بالحياة > > > > > > True > > > > > > > > <> بحياة > > > > > > True > > > > > > > > <> لحياة > > > > > > True > > > > <> دولارا > > > > <> دولار > > > > > > True > > > > > > > > <> بدولار > > > > > > True > > > > > > > > <> بالدولار > > > > > > True > > > > > > > > <> الدولار > > > > > > True > > > > > > > > <> دؤلارا > > > > > > False > > > > > > > > <> دولأرا > > > > > > False > > > > > > > > <> دولارأ > > > > > > False > > > > <> الكاتب > > > > <> كاتب > > > > > > True > > > > > > > > <> لكاتب > > > > > > True > > > > > > > > <> كاتبة > > > > > > True > > > > > > > > <> الكاتبة > > > > > > True > > > > > > > > <> الكاتبات > > > > > > True > > > > > > > > <> كاتبون > > > > > > True > > > > > > > > <> كاتبو \ كاتبي > > > > > > True > > > > > > > > <> كتب > > > > > > False > > > > <> جميلة > > > > <> جميلات > > > > > > True > > > > > > > > <> جميل > > > > > > True > > > > > > > > <> الجمال > > > > > > False > > > > <> بنت > > > > <> ابنة > > > > > > False > > > > > > > > <> بن > > > > > > True > > > > > > > > <> ابن > > > > > > True > > > > > > > > <> ابنت > > > > > > False > > > > > > > > ---------------------------------------------------------------------------------------------------------------------------------------------------------------- > > > > while with the new one, we only got matches for: > > | <> فّ فُ فٌ فف فِِ فٍ ف > > and the likes of that. > > > > -walid > > > > On Thu, 2009-07-23 at 09:33 -0400, Robert Muir wrote: > > > >> walid, can you provide any more information other than "very poor result"? > >> > >> Others have not measured much difference between morphological > >> analysis and light stemming: > >> http://ciir.cs.umass.edu/pubfiles/ir-249.pdf > >> > >> > >> On Thu, Jul 23, 2009 at 7:34 AM, walid wrote: > >> > http://issues.apache.org/jira/browse/LUCENE-1406 > >> > http://issues.apache.org/jira/browse/LUCENE-153 > >> > > >> > based on this, there are two options: > >> > 1- using the aramorph library > >> > 2- moving the code from trunk to the current release and using the > >> > provided arabic analyzer > >> > > >> > 1- the library works very well in indexing, tokenizing, stemming and > >> > everything, but causes memory leaks > >> > 2- the provided library has a very poor result compared to the aramorph > >> > library. > >> > > >> > Is there a plan to have better arabic support with full morphological > >> > analysis support? > >> > > >> > walid > >> > > >> > > >> > --------------------------------------------------------------------- > >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >> > For additional commands, e-mail: java-user-help@lucene.apache.org > >> > > >> > > >> > >> > >> > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org