lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Jain <Eric.J...@isb-sib.ch>
Subject Re: Preventing phrase queries from matching across lines
Date Sat, 29 Apr 2006 12:27:47 GMT
Erik Hatcher wrote:
> On Apr 28, 2006, at 5:35 AM, Eric Jain wrote:
>> What is the best way to prevent a phrase query such as "eggs white" 
>> matching "fried eggs\nwhite snow"?
>>
>> Two possibilities I have thought about:
>>
>> 1. Replace all line breaks with a special string, e.g. "newline".
>> 2. Have an analyzer somehow increment the position of a term for each 
>> line break it encounters.
>>
>> Latter seems a bit more complicated to implement, but it would also be 
>> more efficient, right? Or are there better options?
> 
> #2 shouldn't be too hard to implement, but you'll need to catch new 
> lines in the initial tokenizer.  I'm not sure about the efficiency, both 
> options would require a tokenizer detecting new lines and either 
> injecting a new term or setting a flag such that the next term gets a 
> position increment bump.

Thanks, #2 turned out to be easier to implement than expected. I should 
have precised that the "efficiency" I was concerned about was not the 
efficiency of the tokenization, but the impact of having all those 
additional "newline" term (positions) in the index.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message