lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <dmsmith...@gmail.com>
Subject Re: [jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)
Date Tue, 30 Sep 2008 14:15:51 GMT
Robert Muir wrote:
> can you provide any more information on your use case? I had 
> originally imagined MH, ktiv male spelling only, but your use case is 
> interesting.
>
> Are you currently indexing biblical hebrew text? dotted or undotted?
Biblical Hebrew. Variety of texts. Some unpointed. Others w/ points and 
cantillation. All are NFC.

IMHO, I think it is important to document whether an analyzer works with 
NFC, NFD or whatever. And leave it to the program to normalize to that form.

>
>
> On Tue, Sep 30, 2008 at 8:54 AM, DM Smith <dmsmith555@gmail.com 
> <mailto:dmsmith555@gmail.com>> wrote:
>
>
>     On Sep 30, 2008, at 8:19 AM, Robert Muir wrote:
>
>>     cool. is there interest in similar basic functionality for Hebrew?
>
>     I'm interested as I use lucene for biblical research.
>
>>
>>
>>     same rules apply: without using GPL data (i.e. Hspell data) you
>>     can't do it right, but you can do a lot of the common stuff just
>>     like Arabic. Tokenization is a tad bit more complex, and out of
>>     box western behavior is probably annoying at the least (splitting
>>     words on punctuation where it shouldn't, etc).
>>
>>     Robert
>>
>>     On Tue, Sep 30, 2008 at 7:36 AM, Grant Ingersoll (JIRA)
>>     <jira@apache.org <mailto:jira@apache.org>> wrote:
>>
>>
>>            [
>>         https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635723#action_12635723
>>         <https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635723#action_12635723>
>>         ]
>>
>>         Grant Ingersoll commented on LUCENE-1406:
>>         -----------------------------------------
>>
>>         I'll commit once 2.4 is released.
>>
>>         > new Arabic Analyzer (Apache license)
>>         > ------------------------------------
>>         >
>>         >                 Key: LUCENE-1406
>>         >                 URL:
>>         https://issues.apache.org/jira/browse/LUCENE-1406
>>         >             Project: Lucene - Java
>>         >          Issue Type: New Feature
>>         >          Components: Analysis
>>         >            Reporter: Robert Muir
>>         >            Assignee: Grant Ingersoll
>>         >            Priority: Minor
>>         >         Attachments: LUCENE-1406.patch
>>         >
>>         >
>>         > I've noticed there is no Arabic analyzer for Lucene, most
>>         likely because Tim Buckwalter's morphological dictionary is GPL.
>>         > However, it is not necessary  to have full morphological
>>         analysis engine for a quality arabic search.
>>         > This implementation implements the light-8s algorithm
>>         present in the following paper:
>>         http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
>>         > As you can see from the paper, improvement via this method
>>         over searching surface forms (as lucene currently does) is
>>         significant, with almost 100% improvement in average precision.
>>         > While I personally don't think all the choices were the
>>         best, and some easily improvements are still possible, the
>>         major motivation for implementing it exactly the way it is
>>         presented in the paper is that the algorithm is TREC-tested,
>>         so the precision/recall improvements to lucene are already
>>         documented.
>>         > For a stopword list, I used a list present at
>>         http://members.unine.ch/jacques.savoy/clef/index.html simply
>>         because the creator of this list documents the data as
>>         BSD-licensed.
>>         > This implementation (Analyzer) consists of above mentioned
>>         stopword list plus two filters:
>>         >  ArabicNormalizationFilter: performs orthographic
>>         normalization (such as hamza seated on alif, alif maksura,
>>         teh marbuta, removal of harakat, tatweel, etc)
>>         >  ArabicStemFilter: performs arabic light stemming
>>         > Both filters operate directly on termbuffer for maximum
>>         performance. There is no object creation in this Analyzer.
>>         > There are no external dependencies. I've indexed about half
>>         a billion words of arabic text and tested against that.
>>         > If there are any issues with this implementation I am
>>         willing to fix them. I use lucene on a daily basis and would
>>         like to give something back. Thanks.
>>
>>         --
>>         This message is automatically generated by JIRA.
>>         -
>>         You can reply to this email to add a comment to the issue online.
>>
>>
>>         ---------------------------------------------------------------------
>>         To unsubscribe, e-mail:
>>         java-dev-unsubscribe@lucene.apache.org
>>         <mailto:java-dev-unsubscribe@lucene.apache.org>
>>         For additional commands, e-mail:
>>         java-dev-help@lucene.apache.org
>>         <mailto:java-dev-help@lucene.apache.org>
>>
>>
>>
>>
>>     -- 
>>     Robert Muir
>>     rcmuir@gmail.com <mailto:rcmuir@gmail.com>
>
>
>
>
> -- 
> Robert Muir
> rcmuir@gmail.com <mailto:rcmuir@gmail.com>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message