lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Walt Stoneburner" <walt.stonebur...@gmail.com>
Subject Standard Analyzer Escapes
Date Fri, 13 Jul 2007 15:13:11 GMT
In reading the documentation for escape characters, I'm having a
little trouble understanding what it wants me to do for certain
special cases.

http://lucene.apache.org/java/docs/queryparsersyntax.html#Escaping%20Special%20Characters
says: "Lucene supports escaping special characters that are part of
the query syntax. The current list special characters are:   + - && ||
! ( ) { } [ ] ^ " ~ * ? : \     To escape these character use the \
before the character."

Specifically, I'm curious about the double characters && and || and
how they should be properly escaped.

Experimentation showed some very strange things with the StandardAnalyzer.

Using Luke, I get some interesting mappings.
  AT&T    becomes  at&t    (as expected)
  AT&&T  becomes  t   (tricky... at is now taken as a stop word; fine
makes sense)

..but what about...   "AT&&T"   ...nope, still t.

AAA&BBB becomes aaa&bbb    ...correct
AAA&&BBB becomes   aaa bbb   ...ampersand becomes a space?
"AAA&&BBB" is also    aaa bbb

AAA\&BBB correctly is   aaa&bbb   ...just as before
AAA\&&BBB   is  aaa bbb   ...but perhaps we got the escape wrong.

Is '&&' special "character" and is it escaped as \&& or escaped as
\&\& ...let's find out.

AAA\&\&BBB   is also  aaa bbb   ...perhaps we need quotes?
"AAA\&\&BBB"   is also  aaa bbb   ...I can't seem to get the escape to work.

How about this?
AAA&BBB&CCC    strangely becomes   aaa&bbb ccc

Even when escaped?
AAA\&BBB\&CCC  is also    aaa&bbb ccc    ...appears so.

What about...
AAA&BBB&CCC&DDD   becomes   aaa&bbb ccc&ddd  ....whoa, not expecting that.

AAA&&BBB&&CCC&&DDD  becomes   aaa bbb ccc ddd  ...if && means
AND, ok...

AAA\&&BBB\&&CCC\&&DDD   no change  aaa bbb ccc ddd

AAA\&\&BBB\&\&CCC\&\&DDD  also no change  aaa bbb ccc ddd


It appears I literally cannot search for the token with two ampersands
in it, whether they are touching or not.

Clearly I'm missing something.  Is there a way to get any literal
sequence of my choosing, using escapes, as a term in the Lucene
expression?

-Walt Stoneburner

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message