Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6CB69D8DD for ; Mon, 17 Sep 2012 15:08:53 +0000 (UTC) Received: (qmail 34549 invoked by uid 500); 17 Sep 2012 15:08:51 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 34477 invoked by uid 500); 17 Sep 2012 15:08:51 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 34468 invoked by uid 99); 17 Sep 2012 15:08:51 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 17 Sep 2012 15:08:51 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [204.194.79.134] (HELO mailserver3.caci.com) (204.194.79.134) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 17 Sep 2012 15:08:44 +0000 Received: from verizondibmail.caci.com (HELO ex2010ch03.caci.com) ([172.16.246.18]) by mailserver3.caci.com with ESMTP/TLS/AES128-SHA; 17 Sep 2012 11:08:41 -0400 Received: from EX2010MB01-2.caci.com ([fe80::d580:b8a6:68a8:4f37]) by ex2010ch03.caci.com ([::1]) with mapi id 14.01.0379.000; Mon, 17 Sep 2012 11:08:20 -0400 From: Ilya Zavorin To: "java-user@lucene.apache.org" Subject: RE: how to fully preprocess query before fuzzy search? Thread-Topic: how to fully preprocess query before fuzzy search? Thread-Index: Ac2U4oBxmT+wE7F/Reu0r8MMWj6EuwAI31qAAAf8pbA= Date: Mon, 17 Sep 2012 15:08:20 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.29.230.185] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org Thanks so I do not need to escape the "&" in=20 "dog & cat" But I do need to escape the "&&" in=20 "dog && cat" correct? And do I escape as "dog \&& cat" or as "dog \&\& cat"? Ilya -----Original Message----- From: Jack Krupansky [mailto:jack@basetechnology.com]=20 Sent: Monday, September 17, 2012 10:55 AM To: java-user@lucene.apache.org Subject: Re: how to fully preprocess query before fuzzy search? " Lucene supports escaping special characters that are part of the query synt= ax. The current list special characters are + - && || ! ( ) { } [ ] ^ " ~ * ? : \ / " See: http://lucene.apache.org/core/4_0_0-ALPHA/queryparser/org/apache/lucene/que= ryparser/classic/package-summary.html So, maybe you should escape all special characters, and then add the fuzzy = query. Note: In 4.0 the fuzzy query is limited to an editing distance of 2. -- Jack Krupansky -----Original Message----- From: Ilya Zavorin Sent: Monday, September 17, 2012 10:41 AM To: java-user@lucene.apache.org Subject: how to fully preprocess query before fuzzy search? I am processing a bunch of text coming out of OCR, i.e. it's machine-genera= ted text that contains some errors like garbage characters attached to word= s, letters replaced with similarly looking characters (e.g.=20 "I" with "1") etc. The text is whitespace-tokenized and I am trying to matc= h each token against an index using a fuzzy match, so that small amounts of= occasional garbage in the tokens do not prevent a match. Right now I am preprocessing each query as follows: //term =3D token Query queryF =3D parser.Parse(term.Replace("~", "") + "~"); However, searcher.Search still throws "can't parse" exceptions for queries = that contain brackets, quotes and other garbage characters. So how should I fully preprocess a query to avoid these exceptions? Looks like I just need to remove a certain set of characters just like the = tilde is removed above. What is the complete set of such characters? Do I n= eed to do any other preprocess? Thanks, Ilya Zavorin --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org