Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of dmsmith555@gmail.com
 designates 64.233.170.188 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:user-agent:mime-version:to:subject:references
         :in-reply-to:content-type:content-transfer-encoding;
        b=wsMy5i8xT88/SznZLRBJELzKLMYHIxT8jZkI1cH1FC2mBrmojzgI0wrwTrzypImE/l
         w9F1oDCKuSS7dBS37xGmsR92XRoE9tKF9oCbMy3dArNfmxTZcvlP+ox7e3VkHPgY4ScM
         BTBb+hK7ZaAmJwBarlsZwD/8mpq/O+UF4jIDQ=
Message-ID: <48E23497.7090000@gmail.com>
Date: Tue, 30 Sep 2008 10:15:51 -0400
From: DM Smith <dmsmith555@gmail.com>
User-Agent: Thunderbird 2.0.0.16 (X11/20080723)
MIME-Version: 1.0
To: java-dev@lucene.apache.org
Subject: Re: [jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache
 license)
References: <1370069836.1222430264412.JavaMail.jira@brutus>
	 <1847534690.1222774604541.JavaMail.jira@brutus>
	 <8f0ad1f30809300519u3d02c7a9mc807751dba3325c2@mail.gmail.com>
	 <EC7B22FC-002E-4E74-BAAD-46ED4C1C1E7C@gmail.com>
 <8f0ad1f30809300624nea7c0e3hb0111c0338ca1018@mail.gmail.com>
In-Reply-To: <8f0ad1f30809300624nea7c0e3hb0111c0338ca1018@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

Robert Muir wrote:
> can you provide any more information on your use case? I had 
> originally imagined MH, ktiv male spelling only, but your use case is 
> interesting.
>
> Are you currently indexing biblical hebrew text? dotted or undotted?
Biblical Hebrew. Variety of texts. Some unpointed. Others w/ points and 
cantillation. All are NFC.

IMHO, I think it is important to document whether an analyzer works with 
NFC, NFD or whatever. And leave it to the program to normalize to that form.

>
>
> On Tue, Sep 30, 2008 at 8:54 AM, DM Smith <dmsmith555@gmail.com 
> <mailto:dmsmith555@gmail.com>> wrote:
>
>
>     On Sep 30, 2008, at 8:19 AM, Robert Muir wrote:
>
>>     cool. is there interest in similar basic functionality for Hebrew?
>
>     I'm interested as I use lucene for biblical research.
>
>>
>>
>>     same rules apply: without using GPL data (i.e. Hspell data) you
>>     can't do it right, but you can do a lot of the common stuff just
>>     like Arabic. Tokenization is a tad bit more complex, and out of
>>     box western behavior is probably annoying at the least (splitting
>>     words on punctuation where it shouldn't, etc).
>>
>>     Robert
>>
>>     On Tue, Sep 30, 2008 at 7:36 AM, Grant Ingersoll (JIRA)
>>     <jira@apache.org <mailto:jira@apache.org>> wrote:
>>
>>
>>            [
>>         https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635723#action_12635723
>>         <https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635723#action_12635723>
>>         ]
>>
>>         Grant Ingersoll commented on LUCENE-1406:
>>         -----------------------------------------
>>
>>         I'll commit once 2.4 is released.
>>
>>         > new Arabic Analyzer (Apache license)
>>         > ------------------------------------
>>         >
>>         >                 Key: LUCENE-1406
>>         >                 URL:
>>         https://issues.apache.org/jira/browse/LUCENE-1406
>>         >             Project: Lucene - Java
>>         >          Issue Type: New Feature
>>         >          Components: Analysis
>>         >            Reporter: Robert Muir
>>         >            Assignee: Grant Ingersoll
>>         >            Priority: Minor
>>         >         Attachments: LUCENE-1406.patch
>>         >
>>         >
>>         > I've noticed there is no Arabic analyzer for Lucene, most
>>         likely because Tim Buckwalter's morphological dictionary is GPL.
>>         > However, it is not necessary  to have full morphological
>>         analysis engine for a quality arabic search.
>>         > This implementation implements the light-8s algorithm
>>         present in the following paper:
>>         http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
>>         > As you can see from the paper, improvement via this method
>>         over searching surface forms (as lucene currently does) is
>>         significant, with almost 100% improvement in average precision.
>>         > While I personally don't think all the choices were the
>>         best, and some easily improvements are still possible, the
>>         major motivation for implementing it exactly the way it is
>>         presented in the paper is that the algorithm is TREC-tested,
>>         so the precision/recall improvements to lucene are already
>>         documented.
>>         > For a stopword list, I used a list present at
>>         http://members.unine.ch/jacques.savoy/clef/index.html simply
>>         because the creator of this list documents the data as
>>         BSD-licensed.
>>         > This implementation (Analyzer) consists of above mentioned
>>         stopword list plus two filters:
>>         >  ArabicNormalizationFilter: performs orthographic
>>         normalization (such as hamza seated on alif, alif maksura,
>>         teh marbuta, removal of harakat, tatweel, etc)
>>         >  ArabicStemFilter: performs arabic light stemming
>>         > Both filters operate directly on termbuffer for maximum
>>         performance. There is no object creation in this Analyzer.
>>         > There are no external dependencies. I've indexed about half
>>         a billion words of arabic text and tested against that.
>>         > If there are any issues with this implementation I am
>>         willing to fix them. I use lucene on a daily basis and would
>>         like to give something back. Thanks.
>>
>>         --
>>         This message is automatically generated by JIRA.
>>         -
>>         You can reply to this email to add a comment to the issue online.
>>
>>
>>         ---------------------------------------------------------------------
>>         To unsubscribe, e-mail:
>>         java-dev-unsubscribe@lucene.apache.org
>>         <mailto:java-dev-unsubscribe@lucene.apache.org>
>>         For additional commands, e-mail:
>>         java-dev-help@lucene.apache.org
>>         <mailto:java-dev-help@lucene.apache.org>
>>
>>
>>
>>
>>     -- 
>>     Robert Muir
>>     rcmuir@gmail.com <mailto:rcmuir@gmail.com>
>
>
>
>
> -- 
> Robert Muir
> rcmuir@gmail.com <mailto:rcmuir@gmail.com>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org