lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Renaud Waldura" <>
Subject RE: Wildcard query with untokenized punctuation (again)
Date Thu, 14 Jun 2007 17:29:24 GMT
Thanks guys, I like it! I'm already applying some regexps before query
parsing anyway, so it's just another pass.

Now, I'm not sure how to do that without breaking another QP feature that I
kind of like: the query <<smith,ann>> is parsed to PhraseQuery("smith ann").
And that seems right, from a user standpoint.

In fact, considering this, I realize <<smith,ann*>> should be parsed to
MultiPhraseQuery("smith", "ann*"), not <<+smith +ann*>> as I said earlier.

Brrrr. Getting hairy. Any hope?


-----Original Message-----
From: Mark Miller [] 
Sent: Thursday, June 14, 2007 6:43 AM
Subject: Re: Wildcard query with untokenized punctuation (again)

Gotto agree with Erick idea is just to preprocess the query
before sending it to the QueryParser.

My first thought is always to get out the sledgehammer...

- Mark

Erick Erickson wrote:
> Well, perhaps the simplest thing would be to pre-process the query and 
> make the comma into a whitespace before sending anything to the query 
> parser. I don't know how generalizable that sort of solution is in 
> your problem space though....
> Best
> Erick
> On 6/13/07, Renaud Waldura <> wrote:
>> My very simple analyzer produces tokens made of digits and/or letters 
>> only.
>> Anything else is discarded. E.g. the input "smith,anna" gets 
>> tokenized as
>> 2
>> tokens, first "smith" then "anna".
>> Say I have indexed documents that contained both "smith,anna" and 
>> "smith,annanicole". To find them, I enter the query <<smith,ann*>>. 
>> The stock Lucene 2.0 query parser produces a PrefixQuery for the 
>> single token "smith,ann". This token doesn't exist in my index, and I 
>> don't get a match.
>> I have found some references to this:
>> 378386
>> .
>> html
>> but I don't understand how I can fix it. Comma-separated terms like 
>> this can appear in any field; I don't think I can create an 
>> untokenized field.
>> Really what I would like in this case is for the comma to be 
>> considered whitespace, and the query to be parsed to <<+smith 
>> +ann*>>. Any way I can do that?
>> --Renaud

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message