lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <dmsmith...@gmail.com>
Subject Re: [jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)
Date Tue, 30 Sep 2008 12:54:17 GMT

On Sep 30, 2008, at 8:19 AM, Robert Muir wrote:

> cool. is there interest in similar basic functionality for Hebrew?

I'm interested as I use lucene for biblical research.

>
>
> same rules apply: without using GPL data (i.e. Hspell data) you  
> can't do it right, but you can do a lot of the common stuff just  
> like Arabic. Tokenization is a tad bit more complex, and out of box  
> western behavior is probably annoying at the least (splitting words  
> on punctuation where it shouldn't, etc).
>
> Robert
>
> On Tue, Sep 30, 2008 at 7:36 AM, Grant Ingersoll (JIRA) <jira@apache.org 
> > wrote:
>
>    [ https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635723

> #action_12635723 ]
>
> Grant Ingersoll commented on LUCENE-1406:
> -----------------------------------------
>
> I'll commit once 2.4 is released.
>
> > new Arabic Analyzer (Apache license)
> > ------------------------------------
> >
> >                 Key: LUCENE-1406
> >                 URL: https://issues.apache.org/jira/browse/LUCENE-1406
> >             Project: Lucene - Java
> >          Issue Type: New Feature
> >          Components: Analysis
> >            Reporter: Robert Muir
> >            Assignee: Grant Ingersoll
> >            Priority: Minor
> >         Attachments: LUCENE-1406.patch
> >
> >
> > I've noticed there is no Arabic analyzer for Lucene, most likely  
> because Tim Buckwalter's morphological dictionary is GPL.
> > However, it is not necessary  to have full morphological analysis  
> engine for a quality arabic search.
> > This implementation implements the light-8s algorithm present in  
> the following paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
> > As you can see from the paper, improvement via this method over  
> searching surface forms (as lucene currently does) is significant,  
> with almost 100% improvement in average precision.
> > While I personally don't think all the choices were the best, and  
> some easily improvements are still possible, the major motivation  
> for implementing it exactly the way it is presented in the paper is  
> that the algorithm is TREC-tested, so the precision/recall  
> improvements to lucene are already documented.
> > For a stopword list, I used a list present at http://members.unine.ch/jacques.savoy/clef/index.html

>  simply because the creator of this list documents the data as BSD- 
> licensed.
> > This implementation (Analyzer) consists of above mentioned  
> stopword list plus two filters:
> >  ArabicNormalizationFilter: performs orthographic normalization  
> (such as hamza seated on alif, alif maksura, teh marbuta, removal of  
> harakat, tatweel, etc)
> >  ArabicStemFilter: performs arabic light stemming
> > Both filters operate directly on termbuffer for maximum  
> performance. There is no object creation in this Analyzer.
> > There are no external dependencies. I've indexed about half a  
> billion words of arabic text and tested against that.
> > If there are any issues with this implementation I am willing to  
> fix them. I use lucene on a daily basis and would like to give  
> something back. Thanks.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>
>
> -- 
> Robert Muir
> rcmuir@gmail.com


Mime
View raw message