Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 75741 invoked from network); 6 Aug 2009 20:11:01 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 6 Aug 2009 20:11:01 -0000 Received: (qmail 49586 invoked by uid 500); 6 Aug 2009 20:11:04 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 49539 invoked by uid 500); 6 Aug 2009 20:11:04 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 49509 invoked by uid 99); 6 Aug 2009 20:11:04 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Aug 2009 20:11:04 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of rcmuir@gmail.com designates 209.85.221.203 as permitted sender) Received: from [209.85.221.203] (HELO mail-qy0-f203.google.com) (209.85.221.203) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Aug 2009 20:10:54 +0000 Received: by qyk41 with SMTP id 41so1156980qyk.29 for ; Thu, 06 Aug 2009 13:10:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=nO6DYroaZexbZ0o1HALc1dkqoIqkpnSu2zgHLn7tpnU=; b=fl6oAEowbs0x7nNHdqDQ4+y0X98Cz++X71mMSfr2Y+eKfMlN79e0Lv6fq9wyIZ/Xx/ CgY4o7jtyr7igsEkLQi/SqyW7IS09DmIw+bkhmCEmm/zoNhSW+DHuW6IHsP5cUk4aX5I qFE9o1f+pHawMi0FhKLDHcvIFYF+TQ1hOrb3c= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=WYmjM5pHXavvegz3q3rbuR38NELFDnw91A8wNQAmrQWbReHW8IRLTRfvredwKGot6e p9JeikWPJzoUNcA/BblOEiC57/xYHm5vwdxOiVH7xsYgA/JeA0JfjGzroTyf8E6HBvSE SZDTbHZ2h/YbkRPjj/HL+1Uv5mbwljC6VbKPM= MIME-Version: 1.0 Received: by 10.229.109.202 with SMTP id k10mr374450qcp.58.1249589433285; Thu, 06 Aug 2009 13:10:33 -0700 (PDT) In-Reply-To: <786fde50908061305m64e4ffa9kfcdcc170b7229194@mail.gmail.com> References: <860544ed0908061246h49485a65se5b1acc5719343e9@mail.gmail.com> <8f0ad1f30908061255m70212637s1d788dd49f20a0aa@mail.gmail.com> <786fde50908061305m64e4ffa9kfcdcc170b7229194@mail.gmail.com> Date: Thu, 6 Aug 2009 16:10:33 -0400 Message-ID: <8f0ad1f30908061310j298fa42ai8f8b773a78be37c3@mail.gmail.com> Subject: Re: Language Detection for Analysis? From: Robert Muir To: java-user@lucene.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Shai, I mean doing language-agnostic things that apply to all of these since they are based on the same writing system, like normalizing all yeh characters (arabic yeh, farsi yeh, alef maksura) to the same form, removing harakat, the kinds of things in ArabicNormalizationFilter and PersianNormalizationFilter. A parallel to this is doing "lowercase" to english, french, dutch, etc. Its a good idea. at least in the arabic case you can see here the precision/recall tradeoffs of doing just normalization as I mentioned versus stemming : http://www.mtholyoke.edu/~lballest/Pubs/arab_stem05.pdf the benefit you see from stemming would assume you could language detect 100% accurately, since applying arabic stemming as is will be terrible on average for persian so I would definitely start with ArabicTokenizer + ArabicNormalizationFilter + PersianNormalizationFilter. i think you could also adjust the source code, for example I would probably very light stemming at least keeping leading =D9=88 prefix for all these languages at least. selectively applying some of the persian "stopwords" such as =D9=87=D8=A7 p= lural would probably be ok across all of these as well. so I really have to wonder if the more complex approach at the end of the day would give you better results on average than doing normalization and maybe very light stemming/stopwords... Hope this helps, Robert On Thu, Aug 6, 2009 at 4:05 PM, Shai Erera wrote: > Robert - can you elaborate on what you mean by "just treat it at the scri= pt > level"? > > On Thu, Aug 6, 2009 at 10:55 PM, Robert Muir wrote: > >> Bradford, there is an arabic analyzer in trunk. for farsi there is >> currently a patch available: >> http://issues.apache.org/jira/browse/LUCENE-1628 >> >> one option is not to detect languages at all. >> it could be hard for short queries due to the languages you mentioned >> borrowing from each other. >> but you do not want to apply things like stemming to the wrong language. >> >> instead, you could use ArabicTokenizer + ArabicNormalizationFilter + >> PersianNormalizationFilter and just treat it at the script level. >> >> On Thu, Aug 6, 2009 at 3:46 PM, Bradford >> Stephens wrote: >> > Hey there, >> > >> > We're trying to add foreign language support into our new search >> > engine -- languages like Arabic, Farsi, and Urdu (that don't work with >> > standard analyzers). But our data source doesn't tell us which >> > languages we're actually collecting -- we just get blocks of text. Has >> > anyone here worked on language detection so we can figure out what >> > analyzers to use? Are there commercial solutions? >> > >> > Much appreciated! >> > >> > -- >> > http://www.roadtofailure.com -- The Fringes of Scalability, Social >> > Media, and Computer Science >> > >> >> >> >> -- >> Robert Muir >> rcmuir@gmail.com >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> > --=20 Robert Muir rcmuir@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org