Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 2856 invoked from network); 30 Sep 2008 14:16:44 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 30 Sep 2008 14:16:44 -0000 Received: (qmail 41226 invoked by uid 500); 30 Sep 2008 14:16:37 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 41191 invoked by uid 500); 30 Sep 2008 14:16:37 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 41182 invoked by uid 99); 30 Sep 2008 14:16:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Sep 2008 07:16:37 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dmsmith555@gmail.com designates 64.233.170.188 as permitted sender) Received: from [64.233.170.188] (HELO rn-out-0910.google.com) (64.233.170.188) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Sep 2008 14:15:32 +0000 Received: by rn-out-0910.google.com with SMTP id j71so15798rne.4 for ; Tue, 30 Sep 2008 07:15:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from :user-agent:mime-version:to:subject:references:in-reply-to :content-type:content-transfer-encoding; bh=SWaplg2PYOy4k8YMs+Weq73d0suASkaLL3dEjqWOC4k=; b=wrf0I/IPxIMnOGIz/xBpdbL+PIk20KXLTju7B7pALJTQrauylQk9kUTvmX9Q3VOTYK O3a/dFwIJGHQiRK3edEYjR5Eiy/fD55yDuyatjhy4nPtUFIN+1bonuup1IFkYbLIfOw3 bPPBt6nEBi2pdAGcFaUcQf0yRNwtZams2aKNs= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; b=wsMy5i8xT88/SznZLRBJELzKLMYHIxT8jZkI1cH1FC2mBrmojzgI0wrwTrzypImE/l w9F1oDCKuSS7dBS37xGmsR92XRoE9tKF9oCbMy3dArNfmxTZcvlP+ox7e3VkHPgY4ScM BTBb+hK7ZaAmJwBarlsZwD/8mpq/O+UF4jIDQ= Received: by 10.100.215.6 with SMTP id n6mr5934825ang.72.1222784156631; Tue, 30 Sep 2008 07:15:56 -0700 (PDT) Received: from localhost.localdomain (adsl-67-39-27-222.dsl.dytnoh.ameritech.net [67.39.27.222]) by mx.google.com with ESMTPS id i37sm4297089wxd.9.2008.09.30.07.15.54 (version=TLSv1/SSLv3 cipher=RC4-MD5); Tue, 30 Sep 2008 07:15:55 -0700 (PDT) Message-ID: <48E23497.7090000@gmail.com> Date: Tue, 30 Sep 2008 10:15:51 -0400 From: DM Smith User-Agent: Thunderbird 2.0.0.16 (X11/20080723) MIME-Version: 1.0 To: java-dev@lucene.apache.org Subject: Re: [jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license) References: <1370069836.1222430264412.JavaMail.jira@brutus> <1847534690.1222774604541.JavaMail.jira@brutus> <8f0ad1f30809300519u3d02c7a9mc807751dba3325c2@mail.gmail.com> <8f0ad1f30809300624nea7c0e3hb0111c0338ca1018@mail.gmail.com> In-Reply-To: <8f0ad1f30809300624nea7c0e3hb0111c0338ca1018@mail.gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Robert Muir wrote: > can you provide any more information on your use case? I had > originally imagined MH, ktiv male spelling only, but your use case is > interesting. > > Are you currently indexing biblical hebrew text? dotted or undotted? Biblical Hebrew. Variety of texts. Some unpointed. Others w/ points and cantillation. All are NFC. IMHO, I think it is important to document whether an analyzer works with NFC, NFD or whatever. And leave it to the program to normalize to that form. > > > On Tue, Sep 30, 2008 at 8:54 AM, DM Smith > wrote: > > > On Sep 30, 2008, at 8:19 AM, Robert Muir wrote: > >> cool. is there interest in similar basic functionality for Hebrew? > > I'm interested as I use lucene for biblical research. > >> >> >> same rules apply: without using GPL data (i.e. Hspell data) you >> can't do it right, but you can do a lot of the common stuff just >> like Arabic. Tokenization is a tad bit more complex, and out of >> box western behavior is probably annoying at the least (splitting >> words on punctuation where it shouldn't, etc). >> >> Robert >> >> On Tue, Sep 30, 2008 at 7:36 AM, Grant Ingersoll (JIRA) >> > wrote: >> >> >> [ >> https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635723#action_12635723 >> >> ] >> >> Grant Ingersoll commented on LUCENE-1406: >> ----------------------------------------- >> >> I'll commit once 2.4 is released. >> >> > new Arabic Analyzer (Apache license) >> > ------------------------------------ >> > >> > Key: LUCENE-1406 >> > URL: >> https://issues.apache.org/jira/browse/LUCENE-1406 >> > Project: Lucene - Java >> > Issue Type: New Feature >> > Components: Analysis >> > Reporter: Robert Muir >> > Assignee: Grant Ingersoll >> > Priority: Minor >> > Attachments: LUCENE-1406.patch >> > >> > >> > I've noticed there is no Arabic analyzer for Lucene, most >> likely because Tim Buckwalter's morphological dictionary is GPL. >> > However, it is not necessary to have full morphological >> analysis engine for a quality arabic search. >> > This implementation implements the light-8s algorithm >> present in the following paper: >> http://ciir.cs.umass.edu/pubfiles/ir-249.pdf >> > As you can see from the paper, improvement via this method >> over searching surface forms (as lucene currently does) is >> significant, with almost 100% improvement in average precision. >> > While I personally don't think all the choices were the >> best, and some easily improvements are still possible, the >> major motivation for implementing it exactly the way it is >> presented in the paper is that the algorithm is TREC-tested, >> so the precision/recall improvements to lucene are already >> documented. >> > For a stopword list, I used a list present at >> http://members.unine.ch/jacques.savoy/clef/index.html simply >> because the creator of this list documents the data as >> BSD-licensed. >> > This implementation (Analyzer) consists of above mentioned >> stopword list plus two filters: >> > ArabicNormalizationFilter: performs orthographic >> normalization (such as hamza seated on alif, alif maksura, >> teh marbuta, removal of harakat, tatweel, etc) >> > ArabicStemFilter: performs arabic light stemming >> > Both filters operate directly on termbuffer for maximum >> performance. There is no object creation in this Analyzer. >> > There are no external dependencies. I've indexed about half >> a billion words of arabic text and tested against that. >> > If there are any issues with this implementation I am >> willing to fix them. I use lucene on a daily basis and would >> like to give something back. Thanks. >> >> -- >> This message is automatically generated by JIRA. >> - >> You can reply to this email to add a comment to the issue online. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: >> java-dev-unsubscribe@lucene.apache.org >> >> For additional commands, e-mail: >> java-dev-help@lucene.apache.org >> >> >> >> >> >> -- >> Robert Muir >> rcmuir@gmail.com > > > > > -- > Robert Muir > rcmuir@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org