Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 61519 invoked from network); 27 May 2008 07:18:22 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 27 May 2008 07:18:22 -0000 Received: (qmail 75432 invoked by uid 500); 27 May 2008 07:18:21 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 75199 invoked by uid 500); 27 May 2008 07:18:21 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 75188 invoked by uid 99); 27 May 2008 07:18:21 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 May 2008 00:18:21 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 May 2008 07:17:34 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id BA430234C128 for ; Tue, 27 May 2008 00:17:55 -0700 (PDT) Message-ID: <707841175.1211872675761.JavaMail.jira@brutus> Date: Tue, 27 May 2008 00:17:55 -0700 (PDT) From: "Michael Busch (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Updated: (LUCENE-1189) QueryParser does not correctly handle escaped characters within quoted strings In-Reply-To: <1485010573.1203934851122.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-1189: ---------------------------------- Attachment: lucene-1189.patch Thanks for your patch, Tomer! Your approach certainly seems correct to me. The file I'm attaching has your fix to QueryParser.jj and also a testcase similar to your example that fails before and passes after applying the patch. I'm planning to commit this in a day or so. > QueryParser does not correctly handle escaped characters within quoted strings > ------------------------------------------------------------------------------ > > Key: LUCENE-1189 > URL: https://issues.apache.org/jira/browse/LUCENE-1189 > Project: Lucene - Java > Issue Type: Bug > Components: QueryParser > Affects Versions: 2.2, 2.3, 2.3.1 > Environment: Windows Vista Business (x86 and x64) as well as latest Ubuntu server, both cases under Tomcat 6.0.14. > This shouldn't matter though. > Reporter: Tomer Gabel > Assignee: Michael Busch > Attachments: lucene-1189.patch, QueryParser.jj.patch > > > The Lucene query parser incorrectly handles escaped characters inside quoted strings; specifically, a quoted string that ends with an (escaped) backslash followed by any additional quoted string will not be properly tokenized. Consider the following example: > bq. {{(name:"///mike\\\\\\") or (name:"alphonse")}} > This is not a contrived example -- it derives from an actual bug we've encountered in our system. Running this query will throw an exception, but removing the second clause resolves the problem. After some digging I've found that the problem is with the way quoted strings are processed by the lexer: you'll notice that Mike's name is followed by three escaped backslashes right before the ending quote; looking at the JavaCC code for the query parser highlights the problem: > {code:title=QueryParser.jj|borderStyle=solid} > TOKEN : { > > | > | > | > | > | > | > | > | > | : Boost > | > ... > {code} > Take a look at the way the QUOTED token is constructed -- there is no lexical processing of the escaped characters within the quoted string itself. In the above query the lexer matches everything from the first quote through all the backslashes, _treating the end quote as an escaped character_, thus also matching the starting quote of the second term. This causes a lexer error, because the last quote is then considered the start of a new match. > I've come to understand that the Lucene query handler is supposed to be able to handle unsanitized human input; indeed the lexer above would handle a query like {{"blah\"}} without complaining, but that's a "best-guess" approach that results in bugs with legal, automatically generated queries. I've attached a patch that fixes the erroneous behavior but does not maintain leniency with malformed queries; I believe this is the correct approach because the two design goals are fundamentally at odds. I'd appreciate any comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org