lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-4382) Unicode escape no longer works for non-prefix wildcard terms
Date Thu, 13 Sep 2012 00:11:09 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-4382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jack Krupansky updated LUCENE-4382:
-----------------------------------

    Description: 
LUCENE-588 added support for escaping of wildcard characters, but when the de-escaping logic
was pushed down from the query parser (QueryParserBase) into WildcardQuery, support for Unicode
escaping (backslash, "u", and the four-digit hex Unicode code) was not included.

Two solutions:

1. Do the Unicode de-escaping in the query parser before calling getWildcardQuery.
2. Support Unicode de-escaping in WildcardQuery.

A suffix wildcard does not exhibit this problem because full de-escaping is performed in the
query parser before calling getPrefixQuery.

My test case, added at the beginning of TestExtendedDismaxParser.testFocusQueryParser:

{code}

    assertQ("expected doc is missing (using escaped edismax w/field)",
        req("q", "t_special:literal\\:\\u0063olo*n", 
            "defType", "edismax"),
        "//doc[1]/str[@name='id'][.='46']"); 

{code}

Note: That test case was only used to debug into WildcardQuery to see that the Unicode escape
was not processed correctly. It fails in all cases, but that's because of how the field type
is analyzed.

Here is a Lucene-level test case that can also be debugged to see that WildcardQuery is not
processing the Unicode escape properly. I added it at the start of TestMultiAnalyzer.testMultiAnalyzer:

{code}
    assertEquals("literal\\:\\u0063olo*n", qp.parse("literal\\:\\u0063olo*n").toString());
{code}

Note: This case will always run correctly since it is only checking the input pattern string
for WildcardQuery and not how the de-escaping was performed within WildcardQuery.


  was:
LUCENE-588 added support for escaping of wildcard characters, but when the de-escaping logic
was pushed down from the query parser (QueryParserBase) into WildcardQuery, support for Unicode
escaping (backslash, "u", and the four-digit hex Unicode code) was not included.

Two solutions:

1. Do the Unicode de-escaping in the query parser before calling getWildcardQuery.
2. Support Unicode de-escaping in WildcardQuery.

A suffix wildcard does not exhibit this problem because full de-escaping is performed in the
query parser before calling getPrefixQuery.

My test case, added at the beginning of TestExtendedDismaxParser.testFocusQueryParser:

{code}

    assertQ("expected doc is missing (using escaped edismax w/field)",
        req("q", "t_special:literal\\:\\u0063olo*n", 
            "defType", "edismax"),
        "//doc[1]/str[@name='id'][.='46']"); 

{code}


    
> Unicode escape no longer works for non-prefix wildcard terms
> ------------------------------------------------------------
>
>                 Key: LUCENE-4382
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4382
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/queryparser
>    Affects Versions: 4.0-BETA
>            Reporter: Jack Krupansky
>             Fix For: 4.0
>
>
> LUCENE-588 added support for escaping of wildcard characters, but when the de-escaping
logic was pushed down from the query parser (QueryParserBase) into WildcardQuery, support
for Unicode escaping (backslash, "u", and the four-digit hex Unicode code) was not included.
> Two solutions:
> 1. Do the Unicode de-escaping in the query parser before calling getWildcardQuery.
> 2. Support Unicode de-escaping in WildcardQuery.
> A suffix wildcard does not exhibit this problem because full de-escaping is performed
in the query parser before calling getPrefixQuery.
> My test case, added at the beginning of TestExtendedDismaxParser.testFocusQueryParser:
> {code}
>     assertQ("expected doc is missing (using escaped edismax w/field)",
>         req("q", "t_special:literal\\:\\u0063olo*n", 
>             "defType", "edismax"),
>         "//doc[1]/str[@name='id'][.='46']"); 
> {code}
> Note: That test case was only used to debug into WildcardQuery to see that the Unicode
escape was not processed correctly. It fails in all cases, but that's because of how the field
type is analyzed.
> Here is a Lucene-level test case that can also be debugged to see that WildcardQuery
is not processing the Unicode escape properly. I added it at the start of TestMultiAnalyzer.testMultiAnalyzer:
> {code}
>     assertEquals("literal\\:\\u0063olo*n", qp.parse("literal\\:\\u0063olo*n").toString());
> {code}
> Note: This case will always run correctly since it is only checking the input pattern
string for WildcardQuery and not how the de-escaping was performed within WildcardQuery.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message