lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From walid <walid.bak...@elementn.com>
Subject Re: arabic analyzer
Date Fri, 24 Jul 2009 08:06:53 GMT
We were using the aramorph library for some time and so we mapped out
the set of features it provides, they come as follows:

----------------------------------------------------------------------------------------------------------------------------------------------------------------
The ء and ~ are considered unique characters.

        
              * أ , آ, ا, and إ are distinct
                
              * و and ؤ are distinct
                
              * ى and ئ are distinct



      * The ا and ة (denoting the feminine adjective) at the end of a
        word are optional.
        
      * The ال, ب, ل, ك, بال, كال, لل at the beginning of a word are
        optional
        
      * All حركات as well as the ّ (شدّة) are ignored.
        
      * The ي , و , ات , ون denoting the plural form of a word are
        optional. If the indexed word ends with a ة its plural, which
        replaces the ة with ات , is recognized.
        




The following examples illustrate these rules:


Indexed Word


Search Term


Success


                  الحياة


                  للحياة


True




حياة


True




حيا


False




ألحياة


False




إلحياة


False




كالحياة


True




بالحياة


True




بحياة


True




لحياة


True


دولارا


دولار


True




بدولار


True




بالدولار


True




الدولار


True




دؤلارا


False




دولأرا


False




دولارأ


False


الكاتب


كاتب


True




لكاتب


True




كاتبة


True




الكاتبة


True




الكاتبات


True




كاتبون


True




كاتبو \ كاتبي


True




كتب


False


جميلة


جميلات


True




جميل


True




الجمال


False


بنت


ابنة


False




بن


True




ابن


True




ابنت


False



----------------------------------------------------------------------------------------------------------------------------------------------------------------

while with the new one, we only got matches for:
|
فّ فُ فٌ فف فِِ فٍ ف
 and the likes of that.

-walid

On Thu, 2009-07-23 at 09:33 -0400, Robert Muir wrote:

> walid, can you provide any more information other than "very poor result"?
> 
> Others have not measured much difference between morphological
> analysis and light stemming:
> http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
> 
> 
> On Thu, Jul 23, 2009 at 7:34 AM, walid<walid.bakkar@elementn.com> wrote:
> > http://issues.apache.org/jira/browse/LUCENE-1406
> > http://issues.apache.org/jira/browse/LUCENE-153
> >
> > based on this, there are two options:
> > 1- using the aramorph library
> > 2- moving the code from trunk to the current release and using the
> > provided arabic analyzer
> >
> > 1- the library works very well in indexing, tokenizing, stemming and
> > everything, but causes memory leaks
> > 2- the provided library has a very poor result compared to the aramorph
> > library.
> >
> > Is there a plan to have better arabic support with full morphological
> > analysis support?
> >
> > walid
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> 
> 

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message