Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 74435 invoked from network); 6 Aug 2009 20:06:19 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 6 Aug 2009 20:06:19 -0000 Received: (qmail 41303 invoked by uid 500); 6 Aug 2009 20:06:24 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 41216 invoked by uid 500); 6 Aug 2009 20:06:24 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 41206 invoked by uid 99); 6 Aug 2009 20:06:24 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Aug 2009 20:06:24 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of serera@gmail.com designates 209.85.219.226 as permitted sender) Received: from [209.85.219.226] (HELO mail-ew0-f226.google.com) (209.85.219.226) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Aug 2009 20:06:13 +0000 Received: by ewy26 with SMTP id 26so1234421ewy.5 for ; Thu, 06 Aug 2009 13:05:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=ySdWmEG4EQlhw9SXG5/9wGPrvlFBB48j5cPClejGDrw=; b=LysxS3nl+gT1bC4BSkHVtFquGXmhfrKggTC8M1UbBGFj5W0PPu5holwRoa5oLneehs 1BIT5PPeNJX0tk4YranlaTc8G58zr8Od94l6A27QZA7/1aRxRxHLl3UB65q3SsaDxBLK N3zI73zabg4pNdKAYhSueGJ5SaqV1ieOcK6z0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=CRXPsL8V/pzaK/s8SjlWk+gT6pcn3gm/pNXNGN68gImBmLHRmNLF7QnWoFFgcg59ZT LixnScWg2yGAFXufUJ1rXMU/o71z3dOLdsZbbdU/BAiBc4F7j6w6FRYRa9QRQgZ2eiKj DHDiy/v9MEQTODogcxVXClD51e5rX/AghZcHY= MIME-Version: 1.0 Received: by 10.216.88.71 with SMTP id z49mr50232wee.90.1249589152060; Thu, 06 Aug 2009 13:05:52 -0700 (PDT) In-Reply-To: <8f0ad1f30908061255m70212637s1d788dd49f20a0aa@mail.gmail.com> References: <860544ed0908061246h49485a65se5b1acc5719343e9@mail.gmail.com> <8f0ad1f30908061255m70212637s1d788dd49f20a0aa@mail.gmail.com> Date: Thu, 6 Aug 2009 23:05:52 +0300 Message-ID: <786fde50908061305m64e4ffa9kfcdcc170b7229194@mail.gmail.com> Subject: Re: Language Detection for Analysis? From: Shai Erera To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=0016e6dab0ec861e9b04707ea522 X-Virus-Checked: Checked by ClamAV on apache.org --0016e6dab0ec861e9b04707ea522 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Robert - can you elaborate on what you mean by "just treat it at the script level"? On Thu, Aug 6, 2009 at 10:55 PM, Robert Muir wrote: > Bradford, there is an arabic analyzer in trunk. for farsi there is > currently a patch available: > http://issues.apache.org/jira/browse/LUCENE-1628 > > one option is not to detect languages at all. > it could be hard for short queries due to the languages you mentioned > borrowing from each other. > but you do not want to apply things like stemming to the wrong language. > > instead, you could use ArabicTokenizer + ArabicNormalizationFilter + > PersianNormalizationFilter and just treat it at the script level. > > On Thu, Aug 6, 2009 at 3:46 PM, Bradford > Stephens wrote: > > Hey there, > > > > We're trying to add foreign language support into our new search > > engine -- languages like Arabic, Farsi, and Urdu (that don't work with > > standard analyzers). But our data source doesn't tell us which > > languages we're actually collecting -- we just get blocks of text. Has > > anyone here worked on language detection so we can figure out what > > analyzers to use? Are there commercial solutions? > > > > Much appreciated! > > > > -- > > http://www.roadtofailure.com -- The Fringes of Scalability, Social > > Media, and Computer Science > > > > > > -- > Robert Muir > rcmuir@gmail.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --0016e6dab0ec861e9b04707ea522--