lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilya Zavorin <>
Subject how to fully preprocess query before fuzzy search?
Date Mon, 17 Sep 2012 14:41:20 GMT
I am processing a bunch of text coming out of OCR, i.e. it's machine-generated text that contains
some errors like garbage characters attached to words, letters replaced with similarly looking
characters (e.g. "I" with "1") etc. The text is whitespace-tokenized and I am trying to match
each token against an index using a fuzzy match, so that small amounts of occasional garbage
in the tokens do not prevent a match.

Right now I am preprocessing each query as follows:

//term = token
Query queryF = parser.Parse(term.Replace("~", "") + "~");

However, searcher.Search still throws "can't parse" exceptions for queries that contain brackets,
quotes and other garbage characters.

So how should I fully preprocess a query to avoid these exceptions?

Looks like I just need to remove a certain set of characters just like the tilde is removed
above. What is the complete set of such characters? Do I need to do any other preprocess?


Ilya Zavorin

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message