lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Issues with escaping special characters
Date Thu, 14 May 2009 23:59:04 GMT
I suspect that what's happening is that StandardAnalyzer is breaking
your stream up on the "odd" characters. All escaping them on the
query does is insure that they're not interpreted by the parser as (in
this case), the beginning of a group and a MUST operator. So, I
claim it correctly feeds (Parenth+eses to the analyzer, which then
breaks it up into the tokens you indicated.

Assuming you've tried to index this exact string with StandardAnalyzer,
if you looked in your index (say with Luke), you'd see that "parenth" and
"esis" were the tokens indexed.

Warning: I haven't used the ngram tokenizers, so I know just enough to
be dangerous. That said, you could tokenize these as ngrams. I'm not sure
what the base ngram tokenizer does with your special characters, but you
could pretty easily create your own analyzer that spits out, say, 2-(or
whatever)
grams and use that to index and search, possibly using a second field(s) for
the data you wanted to treat this way...

HTH
Erick

On Thu, May 14, 2009 at 7:18 PM, Ari Miller <ari1974@gmail.com> wrote:

> Say I have a book title, literally:
>
> (Parenth+eses
>
> How would I do a search to find exactly that book title, given the presence
> of the ( and + ?  QueryParser.escape isn't working.
> I would expect to be able to search for (Parenth+eses  [exact match] or
> (Parenth+e  [partial match]
> I can use QueryParser.escape to escape out the user search term, but
> feeding
> that to QueryParser with a StandardAnalyzer doesn't return what I would
> expect.
>
> For example, (Parenth+eses --> QueryParser.escape --> \(Parenth\+eses, when
> parsed becomes:
> PhraseQuery:
>    Term:parenth
>    Term:eses
>
> Note that the escaped special characters seem to be turned into spaces, not
> used literally.
> Up to this point, even attempting to directly create an appropriate query
> (PrefixQuery, PhraseQuery, TermQuery, etc.), I've been unable to come up
> with a query that will match the text with special characters and only that
> text.
> My longer term goal is to be able to take a user search term, identify it
> as
> a literal term (nothing inside should be treated as lucene special
> characters), and do a PrefixQuery with that literal term.
>
> In case it matters, the field I'm searching on is indexed, tokenized, and
> stored.
>
> Potentially relevant existing JIRA issues:
> http://issues.apache.org/jira/browse/LUCENE-271
> http://issues.apache.org/jira/browse/LUCENE-588
>
> Thanks,
> Ari
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message