jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Klimetschek <aklim...@day.com>
Subject Re: Double String escaping - is it needed?
Date Tue, 10 Aug 2010 22:45:40 GMT
On Tue, Aug 10, 2010 at 16:56, Benjamin Brown
<benjamin.brown@thisisnumero.com> wrote:
> Following on...
>
> My (possibly mis)understanding is that Text.escapeIllegalXpathSearchChars(String s) is
incomplete and should be covering all the characters in http://lucene.apache.org/java/2_4_0/queryparsersyntax.html#Escaping%20Special%20Characters

Generally the escaping is necessary to differentiate between actually
looking for that literal including a special character (which mostly
isn't available in the full-text index anyway) or having those special
characters be used as part of the lucene search syntax.

Text.escapeIllegalXpathSearchChars() only checks for special chars at
the end of the string, which make the query invalid (these are
currently: ! ( : ^ [ ] { } ? ), nothing more. It is just a helper that
doesn't cover everything, as that also depends on the Lucene version
etc.

Also, this depends on how much of the Lucene fulltext search features
you want to pass through to the end-users search box. Eg. supporting
grouping of terms with double-quotes is quite common and useful. In
that respect I recently came across the issue of an unequal number of
double-quotes in the string (using single-quotes to place the literal
into the xpath) that leads to invalid query exceptions, so I had to
build a custom helper:

    public static String getFulltextStringLiteral(String value) {
        // double-quotes are used to build search term groups
        // if an uneven number occurs (eg. one), Jackrabbit's Lucene
search parser chokes,
        // thus we have to escape the last double-quote by putting a
backslash in front
        int doubleQuotes = count(value, '"');
        if (doubleQuotes % 2 == 1) {
            int last = value.lastIndexOf('"');
            value = value.substring(0, last) + "\\" + value.substring(last);
        }
        // escape special full text search chars (at the end of the string)
        value = Text.escapeIllegalXpathSearchChars(value);
        // put in single-quotes and double escape them inside the string
        return "'" + value.replaceAll("'", "''") + "'";
    }

Not sure if you can really come up with a one-size fits all solution...

> Since escapeIllegalXpathSearchChars "[is a] method you should use for calls to jcr:contains(...)"
and currently does NOT actually contain escaping for XPath's invalid single quote ('') - I
believe the intention of the method was mainly just to prevent Lucene issues.

Yes.

> Thus am I correct in that really all that needs to be done it to replace the existing
escapeIllegalXpathSearchChars with:
>
> public static String escapeIllegalXpathSearchChars(String s) {
>
>      // Escape Lucene special chars & XPath illegal chars
>        s = org.apache.lucene.queryParser.QueryParser.escape(s);
>
>      // Escape XPath illegal single quotes
>      return s.replaceAll("'", "''");
> }

jackrabbit-jcr-commons shouldn't have a dependency on lucene, so it
would make sense to copy the method instead. And maybe one (or more
;-)) new methods make more sense instead of changing the existing one,
in order to keep backwards-compatibility.

> I think this would prevent both XPath and Lucene issues as the name of the method suggests
(for as long as the Lucene special characters are also illegal XPath characters) or do I actually
need to escape twice, once for jcr's xpath, once for lucene as per my last posting....?

The xpath-level escaping is only for the single-quote within the
jcr:contains query string, if that one is defined with single-quotes
(I think you can also use double-quotes for the literal, but that
turns out to make things more complicated, iirc).

Regards,
Alex

-- 
Alexander Klimetschek
alexander.klimetschek@day.com

Mime
View raw message