Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 7389 invoked from network); 10 Jul 2009 22:23:17 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 10 Jul 2009 22:23:17 -0000 Received: (qmail 46628 invoked by uid 500); 10 Jul 2009 22:13:51 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 46606 invoked by uid 500); 10 Jul 2009 22:13:51 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 46593 invoked by uid 99); 10 Jul 2009 22:13:50 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Jul 2009 22:13:50 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of rcmuir@gmail.com designates 209.85.210.197 as permitted sender) Received: from [209.85.210.197] (HELO mail-yx0-f197.google.com) (209.85.210.197) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Jul 2009 22:13:41 +0000 Received: by yxe35 with SMTP id 35so260662yxe.29 for ; Fri, 10 Jul 2009 15:13:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=ECU0P8iLLP7mroLBFYsBOvuUgc6RsjdQGHi2wwYUYHg=; b=m6zwqX17rwUfUUBEQPTF1yJ+tFl7grlhZFl0S6Lbp501nVfuhjVzJzFZ+p3+4h/yE4 qbTGmuV4caixoKrJtBVyq1Cvl8XmC4DXwpN4cefvM4okTI1p8FAGlQVPWo+4zvi+tf+r Vgs6Qn0pe4J7eEbM7cfZRtQO/VXJLz7rIaZgg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=ltjCaDDQqs5XnUS+YbpLpfG175iW0H1ljcdjMTbj+dv+1iBHo5SVMnZi0lje8L3wWB TwM0jN7uSVGD/SGox/QUfAZ62W0w41R5djzLZ0tcT/DTZ3FM2kcIFxe7ERI8e43FEDM2 T1OeT8S7IAjk8VswlI2swYYThOLDo8bTtzP6g= MIME-Version: 1.0 Received: by 10.100.127.14 with SMTP id z14mr3418148anc.37.1247264000721; Fri, 10 Jul 2009 15:13:20 -0700 (PDT) In-Reply-To: References: Date: Fri, 10 Jul 2009 18:13:20 -0400 Message-ID: <8f0ad1f30907101513n40ec9e71v3d25c0fab3e345f6@mail.gmail.com> Subject: Re: Hindi, diacritics and search results From: Robert Muir To: java-user@lucene.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Which analyzer in particular are you using? Its probably not doing what you want for hindi. These "diacritics" are important (vowels, etc). On Fri, Jul 10, 2009 at 3:10 PM, OBender wrote: > Hi All, > > > > I'm using the default setup of lucene (no custom analyzers configured) an= d > came across the following issue: > > In Hindi if there is a letter with a diacritic in a phrase lucene will fi= nd > the phrase with this letter even if the search string is for the letter > without a diacritics. > > Is this an expected behavior? Maybe this is standard for all languages wi= th > letters that have diacritics? > > > > From pure byte standpoint I can see the logic, the letter with diacritics > takes 6 bytes (E0 A4 95 E0 A5 87) and the single letter takes =C2=A03 (E0= A4 95) > so if I search for *some_letter* where some letter has code (E0 A4 95) > lucene finds the "phrase" (E0 A4 95 E0 A5 87) that includes that letter. > > > > Any comments much appreciated. > > > > Thanks. > > > > --=20 Robert Muir rcmuir@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org