lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From walid <walid.bak...@elementn.com>
Subject Re: arabic analyzer
Date Sun, 02 Aug 2009 00:17:19 GMT
I guess in that case, my users will be angry :)
the fact is, plural (as an example) is not supported, and that is one of
the most common things that a person doing some search will expect to
not have to worry about.
anyway, will roll it out and see the users' reaction :)

thank you.

-walid 

On Fri, 2009-07-24 at 08:39 -0400, Robert Muir wrote:
> walid, it is true some of what you mentioned (from aramorph) works in
> light stemming version, some does not.
> The problem is that its not clear to me that what aramorph is doing is
> really the best.
> 
> >From the paper I sent you:
> 
> The best stemmer in our experiments, light8-s was very simple and did
> not try to find roots or take into account most of Arabic morphology.
> It is probably not essential for the stemmer to yield the correct
> forms, whether stems or roots.
> It is sufficient for it to group most of the forms that belong together.
> 
> This is what is being used in lucene, light8-s. If you read section
> 5.2.1 of the paper, you will see this method outperforms the
> morphological analysis method you speak of (using the same buckwalter
> dictionary)
> 
> But I also understand this is just a general text IR relevance
> measurement (your specific text might vary), and it does not take into
> account some human factors (it can be better on average, but make
> users angry, that type of thing).
> 
> Another problem I have with this situation is that I'm not sure the
> morph. analysis method is really wrong, just that perhaps aramorph /
> that paper might be indexing the wrong thing. For example, aramorph
> indexes arabic stems, but the latest buckwalter dictionary has
> lemmaID, why not index that?
> 
> anyway, I hope in the future there will be more options, that would be
> a good thing!
> 
> On Fri, Jul 24, 2009 at 4:06 AM, walid<walid.bakkar@elementn.com> wrote:
> > We were using the aramorph library for some time and so we mapped out
> > the set of features it provides, they come as follows:
> >
> > ----------------------------------------------------------------------------------------------------------------------------------------------------------------
> > The ء and ~ are considered unique characters.
> >
> >
 <>              * أ , آ, ا, and إ are distinct
> >
 <>              * و and ؤ are distinct
> >
 <>              * ى and ئ are distinct
> >
> >
> >
> >      * The ا and ة (denoting the feminine adjective) at the end of a
> >        word are optional.
> >
> >      * The ال, ب, ل, ك, بال, كال, لل at the beginning of a word are
> >        optional
> >
> >      * All حركات as well as the ّ (شدّة) are ignored.
> >
> >      * The ي , و , ات , ون denoting the plural form of a word are
> >        optional. If the indexed word ends with a ة its plural, which
> >        replaces the ة with ات , is recognized.
> >
> >
> >
> >
> >
> > The following examples illustrate these rules:
> >
> >
> > Indexed Word
> >
> >
> > Search Term
> >
> >
> > Success
> >
> >
 <>                  الحياة
> >
> >
 <>                  للحياة
> >
> >
> > True
> >
> >
> >
> >
 <> حياة
> >
> >
> > True
> >
> >
> >
> >
 <> حيا
> >
> >
> > False
> >
> >
> >
> >
 <> ألحياة
> >
> >
> > False
> >
> >
> >
> >
 <> إلحياة
> >
> >
> > False
> >
> >
> >
> >
 <> كالحياة
> >
> >
> > True
> >
> >
> >
> >
 <> بالحياة
> >
> >
> > True
> >
> >
> >
> >
 <> بحياة
> >
> >
> > True
> >
> >
> >
> >
 <> لحياة
> >
> >
> > True
> >
> >
 <> دولارا
> >
> >
 <> دولار
> >
> >
> > True
> >
> >
> >
> >
 <> بدولار
> >
> >
> > True
> >
> >
> >
> >
 <> بالدولار
> >
> >
> > True
> >
> >
> >
> >
 <> الدولار
> >
> >
> > True
> >
> >
> >
> >
 <> دؤلارا
> >
> >
> > False
> >
> >
> >
> >
 <> دولأرا
> >
> >
> > False
> >
> >
> >
> >
 <> دولارأ
> >
> >
> > False
> >
> >
 <> الكاتب
> >
> >
 <> كاتب
> >
> >
> > True
> >
> >
> >
> >
 <> لكاتب
> >
> >
> > True
> >
> >
> >
> >
 <> كاتبة
> >
> >
> > True
> >
> >
> >
> >
 <> الكاتبة
> >
> >
> > True
> >
> >
> >
> >
 <> الكاتبات
> >
> >
> > True
> >
> >
> >
> >
 <> كاتبون
> >
> >
> > True
> >
> >
> >
> >
 <> كاتبو \ كاتبي
> >
> >
> > True
> >
> >
> >
> >
 <> كتب
> >
> >
> > False
> >
> >
 <> جميلة
> >
> >
 <> جميلات
> >
> >
> > True
> >
> >
> >
> >
 <> جميل
> >
> >
> > True
> >
> >
> >
> >
 <> الجمال
> >
> >
> > False
> >
> >
 <> بنت
> >
> >
 <> ابنة
> >
> >
> > False
> >
> >
> >
> >
 <> بن
> >
> >
> > True
> >
> >
> >
> >
 <> ابن
> >
> >
> > True
> >
> >
> >
> >
 <> ابنت
> >
> >
> > False
> >
> >
> >
> > ----------------------------------------------------------------------------------------------------------------------------------------------------------------
> >
> > while with the new one, we only got matches for:
> > |
 <> فّ فُ فٌ فف فِِ فٍ ف
> >  and the likes of that.
> >
> > -walid
> >
> > On Thu, 2009-07-23 at 09:33 -0400, Robert Muir wrote:
> >
> >> walid, can you provide any more information other than "very poor result"?
> >>
> >> Others have not measured much difference between morphological
> >> analysis and light stemming:
> >> http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
> >>
> >>
> >> On Thu, Jul 23, 2009 at 7:34 AM, walid<walid.bakkar@elementn.com> wrote:
> >> > http://issues.apache.org/jira/browse/LUCENE-1406
> >> > http://issues.apache.org/jira/browse/LUCENE-153
> >> >
> >> > based on this, there are two options:
> >> > 1- using the aramorph library
> >> > 2- moving the code from trunk to the current release and using the
> >> > provided arabic analyzer
> >> >
> >> > 1- the library works very well in indexing, tokenizing, stemming and
> >> > everything, but causes memory leaks
> >> > 2- the provided library has a very poor result compared to the aramorph
> >> > library.
> >> >
> >> > Is there a plan to have better arabic support with full morphological
> >> > analysis support?
> >> >
> >> > walid
> >> >
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >
> >> >
> >>
> >>
> >>
> >
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message