lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tomer Gabel (JIRA)" <>
Subject [jira] Updated: (LUCENE-1189) QueryParser does not correctly handle escaped characters within quoted strings
Date Mon, 25 Feb 2008 10:22:51 GMT


Tomer Gabel updated LUCENE-1189:

    Attachment: QueryParser.jj.patch

The patch to correct the query parser behavior.

> QueryParser does not correctly handle escaped characters within quoted strings
> ------------------------------------------------------------------------------
>                 Key: LUCENE-1189
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>    Affects Versions: 2.2, 2.3, 2.3.1
>         Environment: Windows Vista Business (x86 and x64) as well as latest Ubuntu server,
both cases under Tomcat 6.0.14.
> This shouldn't matter though.
>            Reporter: Tomer Gabel
>         Attachments: QueryParser.jj.patch
> The Lucene query parser incorrectly handles escaped characters inside quoted strings;
specifically, a quoted string that ends with an (escaped) backslash followed by any additional
quoted string will not be properly tokenized. Consider the following example:
> bq. {{(name:"///mike\\\\\\") or (name:"alphonse")}}
> This is not a contrived example -- it derives from an actual bug we've encountered in
our system. Running this query will throw an exception, but removing the second clause resolves
the problem. After some digging I've found that the problem is with the way quoted strings
are processed by the lexer: you'll notice that Mike's name is followed by three escaped backslashes
right before the ending quote; looking at the JavaCC code for the query parser highlights
the problem:
> {code:title=QueryParser.jj|borderStyle=solid}
>   <AND:       ("AND" | "&&") >
> | <OR:        ("OR" | "||") >
> | <NOT:       ("NOT" | "!") >
> | <PLUS:      "+" >
> | <MINUS:     "-" >
> | <LPAREN:    "(" >
> | <RPAREN:    ")" >
> | <COLON:     ":" >
> | <STAR:      "*" >
> | <CARAT:     "^" > : Boost
> | <QUOTED:     "\"" (~["\""] | "\\\"")* "\"">
> ...
> {code}
> Take a look at the way the QUOTED token is constructed -- there is no lexical processing
of the escaped characters within the quoted string itself. In the above query the lexer matches
everything from the first quote through all the backslashes, _treating the end quote as an
escaped character_, thus also matching the starting quote of the second term. This causes
a lexer error, because the last quote is then considered the start of a new match.
> I've come to understand that the Lucene query handler is supposed to be able to handle
unsanitized human input; indeed the lexer above would handle a query like {{"blah\"}} without
complaining, but that's a "best-guess" approach that results in bugs with legal, automatically
generated queries. I've attached a patch that fixes the erroneous behavior but does not maintain
leniency with malformed queries; I believe this is the correct approach because the two design
goals are fundamentally at odds. I'd appreciate any comments.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message