lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Is semicolon a character that needs escaping?
Date Thu, 02 Sep 2010 22:57:24 GMT

On Sep 2, 2010, at 12:35pm, Michael Lackhoff wrote:

> According to http://lucene.apache.org/java/2_9_1/ 
> queryparsersyntax.html
> only these characters need escaping:
> + - && || ! ( ) { } [ ] ^ " ~ * ? : \
> but with this simple query:
> TI:stroke; AND TI:journal
> I got the error message:
> HTTP ERROR: 400
> Unknown sort order: TI:journal
>
> My first guess was that it was a URL encoding issue but everything  
> looks
> fine:
> http://localhost:8983/solr/select/?q=TI%3Astroke%3B+AND+TI%3Ajournal&version=2.2&start=0&rows=10&indent=on
> as you can see, the semicolon is encoded as %3B
> There is no problem when the query ends with the semicolon:
> TI:stroke;
> gives no error.
> The first query also works if I escape the semicolon:
> TI:stroke\; AND TI:journal
>
> From this I conclude that there is a bug either in the docs or in the
> query parser or I missed something. What is wrong here?

The docs need to be updated, I believe. From some code I wrote back in  
2006...

         // Also note that we escape ';', as Solr uses this to support  
embedding
         // commands into the query string (yikes), and the code base  
we're using
         // has a bug where if the ';' doesn't have two tokens after  
it (white-
         // space separated) then you get an array index out of bounds  
error.

I also had this note, no idea if it's still an issue:

         // Before we do regular escaping, work around a bug in the  
Lucene query
         // parser. If the last character is a '\', we can escape it  
as '\\', but
         // if we build an expression that looks like xxx AND  
(<querytext\>) then
         // the Lucene query parser will treat the final '\' before  
the ')' as
         // a signal to escape the ')' character. That's just wrong,  
but for now
         // we'll just strip off any trailing '\' characters in the  
clause.

But in general escaping characters in a query gets tricky - if you can  
directly build queries versus pre-processing text sent to the query  
parser, you'll save yourself some pain and suffering.

Also, since I did the above code the DisMaxRequestHandler has been  
added to Solr, and it (IIRC) tries to be smart about handling this  
type of escaping for you.

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message