lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: how to fully preprocess query before fuzzy search?
Date Mon, 17 Sep 2012 16:39:41 GMT
Either is fine. In fact just escape based on the individual character, not 
the context. The multi-character context is telling you places where escape 
is not essential, but that doesn't mean it would hurt.

-- Jack Krupansky

-----Original Message----- 
From: Ilya Zavorin
Sent: Monday, September 17, 2012 11:08 AM
To: java-user@lucene.apache.org
Subject: RE: how to fully preprocess query before fuzzy search?

Thanks so I do not need to escape the "&" in

"dog & cat"

But I do need to escape the "&&" in

"dog && cat"

correct? And do I escape as "dog \&& cat" or as "dog \&\& cat"?


Ilya


-----Original Message-----
From: Jack Krupansky [mailto:jack@basetechnology.com]
Sent: Monday, September 17, 2012 10:55 AM
To: java-user@lucene.apache.org
Subject: Re: how to fully preprocess query before fuzzy search?

"
Lucene supports escaping special characters that are part of the query 
syntax. The current list special characters are

+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /
"

See:
http://lucene.apache.org/core/4_0_0-ALPHA/queryparser/org/apache/lucene/queryparser/classic/package-summary.html

So, maybe you should escape all special characters, and then add the fuzzy 
query. Note: In 4.0 the fuzzy query is limited to an editing distance of 2.

-- Jack Krupansky

-----Original Message-----
From: Ilya Zavorin
Sent: Monday, September 17, 2012 10:41 AM
To: java-user@lucene.apache.org
Subject: how to fully preprocess query before fuzzy search?

I am processing a bunch of text coming out of OCR, i.e. it's 
machine-generated text that contains some errors like garbage characters 
attached to words, letters replaced with similarly looking characters (e.g.
"I" with "1") etc. The text is whitespace-tokenized and I am trying to match 
each token against an index using a fuzzy match, so that small amounts of 
occasional garbage in the tokens do not prevent a match.

Right now I am preprocessing each query as follows:

//term = token
Query queryF = parser.Parse(term.Replace("~", "") + "~");

However, searcher.Search still throws "can't parse" exceptions for queries 
that contain brackets, quotes and other garbage characters.

So how should I fully preprocess a query to avoid these exceptions?

Looks like I just need to remove a certain set of characters just like the 
tilde is removed above. What is the complete set of such characters? Do I 
need to do any other preprocess?

Thanks,

Ilya Zavorin


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message