lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir" <rcm...@gmail.com>
Subject Re: [jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)
Date Tue, 30 Sep 2008 19:58:30 GMT
Thanks for clarification. With this method arabic analyzer could lemmatize,
not stem, using buckwalter dictionary, and things like broken plural will
work correctly.

I'm not sure yet if hspell has this type of information, but it would at
least be a better stem for hebrew as well.


On Tue, Sep 30, 2008 at 11:09 AM, Otis Gospodnetic <
otis_gospodnetic@yahoo.com> wrote:

> Oh, and a note on GPL.  It's fine to make use of GPL data, it's just that
> ASF cannot distribute it.  So the code could come with (java)docs that point
> out that things would be better if the analyzer could use the GPL data that
> can be downloaded from X and it could be written to make use of the GPL data
> if it's available/given to the analyzer.
>
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
> From: Robert Muir <rcmuir@gmail.com>
> To: java-dev@lucene.apache.org
> Sent: Tuesday, September 30, 2008 8:19:35 AM
> Subject: Re: [jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache
> license)
>
> cool. is there interest in similar basic functionality for Hebrew?
>
> same rules apply: without using GPL data (i.e. Hspell data) you can't do it
> right, but you can do a lot of the common stuff just like Arabic.
> Tokenization is a tad bit more complex, and out of box western behavior is
> probably annoying at the least (splitting words on punctuation where it
> shouldn't, etc).
>
> Robert
>
> On Tue, Sep 30, 2008 at 7:36 AM, Grant Ingersoll (JIRA) <jira@apache.org>wrote:
>
>>
>>    [
>> https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635723#action_12635723]
>>
>> Grant Ingersoll commented on LUCENE-1406:
>> -----------------------------------------
>>
>> I'll commit once 2.4 is released.
>>
>> > new Arabic Analyzer (Apache license)
>> > ------------------------------------
>> >
>> >                 Key: LUCENE-1406
>> >                 URL: https://issues.apache.org/jira/browse/LUCENE-1406
>> >             Project: Lucene - Java
>> >          Issue Type: New Feature
>> >          Components: Analysis
>> >            Reporter: Robert Muir
>> >            Assignee: Grant Ingersoll
>> >            Priority: Minor
>> >         Attachments: LUCENE-1406.patch
>> >
>> >
>> > I've noticed there is no Arabic analyzer for Lucene, most likely because
>> Tim Buckwalter's morphological dictionary is GPL.
>> > However, it is not necessary  to have full morphological analysis engine
>> for a quality arabic search.
>> > This implementation implements the light-8s algorithm present in the
>> following paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
>> > As you can see from the paper, improvement via this method over
>> searching surface forms (as lucene currently does) is significant, with
>> almost 100% improvement in average precision.
>> > While I personally don't think all the choices were the best, and some
>> easily improvements are still possible, the major motivation for
>> implementing it exactly the way it is presented in the paper is that the
>> algorithm is TREC-tested, so the precision/recall improvements to lucene are
>> already documented.
>> > For a stopword list, I used a list present at
>> http://members.unine.ch/jacques.savoy/clef/index.html simply because the
>> creator of this list documents the data as BSD-licensed.
>> > This implementation (Analyzer) consists of above mentioned stopword list
>> plus two filters:
>> >  ArabicNormalizationFilter: performs orthographic normalization (such as
>> hamza seated on alif, alif maksura, teh marbuta, removal of harakat,
>> tatweel, etc)
>> >  ArabicStemFilter: performs arabic light stemming
>> > Both filters operate directly on termbuffer for maximum performance.
>> There is no object creation in this Analyzer.
>> > There are no external dependencies. I've indexed about half a billion
>> words of arabic text and tested against that.
>> > If there are any issues with this implementation I am willing to fix
>> them. I use lucene on a daily basis and would like to give something back.
>> Thanks.
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>



-- 
Robert Muir
rcmuir@gmail.com

Mime
View raw message