lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir" <rcm...@gmail.com>
Subject Re: [jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)
Date Tue, 30 Sep 2008 14:36:05 GMT
thanks for your feedback. below is a description of an idea for a biblical
hebrew stemmer that would work somewhat differently than a modern hebrew
stemmer.

With regards to pointing i can imagine a user might be frustrated if a word
is stemmed too aggressively when niqqud is present in query or text or both.
It would be nice to make use of niqqud information when available for higher
precision stemming before it is normalized away.

It is necessary to stem consistently in both ways, because user input is
likely to sometimes not contain niqqud. a trick could be done whereas
multiple tokens are indexed for undotted text (i.e. without ha- and with
ha-) as synonyms but not for dotted text (since there is less ambiguity with
the niqqud present). This would ensure that recall does not suffer, would
increase precision, and would not increase index size for dotted text.

The downside is that for your undotted biblical text index size would
increase. This is why this would have to be a separate analyzer than the
modern hebrew stemmer, because niqqud is rare for MH.

with regards to your comment about unicode normalization, I am unaware of
any characters in the Hebrew block that are encoded differently in NFC
versus NFD. The only thing this would affect would be 'Hebrew Presentation
Forms' block. The analyzer would not work with presentation forms text, just
as the arabic analyzer doesn't, as you need unicode normalization (java 6 or
ICU) to fix this text.

Thanks,
Robert

On Tue, Sep 30, 2008 at 10:15 AM, DM Smith <dmsmith555@gmail.com> wrote:

> Robert Muir wrote:
>
>> can you provide any more information on your use case? I had originally
>> imagined MH, ktiv male spelling only, but your use case is interesting.
>>
>> Are you currently indexing biblical hebrew text? dotted or undotted?
>>
> Biblical Hebrew. Variety of texts. Some unpointed. Others w/ points and
> cantillation. All are NFC.
>
> IMHO, I think it is important to document whether an analyzer works with
> NFC, NFD or whatever. And leave it to the program to normalize to that form.
>
>
>>
>> On Tue, Sep 30, 2008 at 8:54 AM, DM Smith <dmsmith555@gmail.com <mailto:
>> dmsmith555@gmail.com>> wrote:
>>
>>
>>    On Sep 30, 2008, at 8:19 AM, Robert Muir wrote:
>>
>>     cool. is there interest in similar basic functionality for Hebrew?
>>>
>>
>>    I'm interested as I use lucene for biblical research.
>>
>>
>>>
>>>    same rules apply: without using GPL data (i.e. Hspell data) you
>>>    can't do it right, but you can do a lot of the common stuff just
>>>    like Arabic. Tokenization is a tad bit more complex, and out of
>>>    box western behavior is probably annoying at the least (splitting
>>>    words on punctuation where it shouldn't, etc).
>>>
>>>    Robert
>>>
>>>    On Tue, Sep 30, 2008 at 7:36 AM, Grant Ingersoll (JIRA)
>>>    <jira@apache.org <mailto:jira@apache.org>> wrote:
>>>
>>>
>>>           [
>>>
>>> https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635723#action_12635723
>>>        <
>>> https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635723#action_12635723
>>> >
>>>        ]
>>>
>>>        Grant Ingersoll commented on LUCENE-1406:
>>>        -----------------------------------------
>>>
>>>        I'll commit once 2.4 is released.
>>>
>>>        > new Arabic Analyzer (Apache license)
>>>        > ------------------------------------
>>>        >
>>>        >                 Key: LUCENE-1406
>>>        >                 URL:
>>>        https://issues.apache.org/jira/browse/LUCENE-1406
>>>        >             Project: Lucene - Java
>>>        >          Issue Type: New Feature
>>>        >          Components: Analysis
>>>        >            Reporter: Robert Muir
>>>        >            Assignee: Grant Ingersoll
>>>        >            Priority: Minor
>>>        >         Attachments: LUCENE-1406.patch
>>>        >
>>>        >
>>>        > I've noticed there is no Arabic analyzer for Lucene, most
>>>        likely because Tim Buckwalter's morphological dictionary is GPL.
>>>        > However, it is not necessary  to have full morphological
>>>        analysis engine for a quality arabic search.
>>>        > This implementation implements the light-8s algorithm
>>>        present in the following paper:
>>>        http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
>>>        > As you can see from the paper, improvement via this method
>>>        over searching surface forms (as lucene currently does) is
>>>        significant, with almost 100% improvement in average precision.
>>>        > While I personally don't think all the choices were the
>>>        best, and some easily improvements are still possible, the
>>>        major motivation for implementing it exactly the way it is
>>>        presented in the paper is that the algorithm is TREC-tested,
>>>        so the precision/recall improvements to lucene are already
>>>        documented.
>>>        > For a stopword list, I used a list present at
>>>        http://members.unine.ch/jacques.savoy/clef/index.html simply
>>>        because the creator of this list documents the data as
>>>        BSD-licensed.
>>>        > This implementation (Analyzer) consists of above mentioned
>>>        stopword list plus two filters:
>>>        >  ArabicNormalizationFilter: performs orthographic
>>>        normalization (such as hamza seated on alif, alif maksura,
>>>        teh marbuta, removal of harakat, tatweel, etc)
>>>        >  ArabicStemFilter: performs arabic light stemming
>>>        > Both filters operate directly on termbuffer for maximum
>>>        performance. There is no object creation in this Analyzer.
>>>        > There are no external dependencies. I've indexed about half
>>>        a billion words of arabic text and tested against that.
>>>        > If there are any issues with this implementation I am
>>>        willing to fix them. I use lucene on a daily basis and would
>>>        like to give something back. Thanks.
>>>
>>>        --
>>>        This message is automatically generated by JIRA.
>>>        -
>>>        You can reply to this email to add a comment to the issue online.
>>>
>>>
>>>
>>>  ---------------------------------------------------------------------
>>>        To unsubscribe, e-mail:
>>>        java-dev-unsubscribe@lucene.apache.org
>>>        <mailto:java-dev-unsubscribe@lucene.apache.org>
>>>        For additional commands, e-mail:
>>>        java-dev-help@lucene.apache.org
>>>        <mailto:java-dev-help@lucene.apache.org>
>>>
>>>
>>>
>>>
>>>    --    Robert Muir
>>>    rcmuir@gmail.com <mailto:rcmuir@gmail.com>
>>>
>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com <mailto:rcmuir@gmail.com>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Mime
View raw message