incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Karman <>
Subject Re: ProximityQuery
Date Mon, 22 Mar 2010 01:50:10 GMT
Marvin Humphrey wrote on 3/21/10 3:07 PM:
> On Sun, Mar 21, 2010 at 02:01:41AM -0500, Peter Karman wrote:
>> Marvin, please have a look when you have a chance, and let me know what needs
>> changing.
> The current implementation has a limitation I think is probably pretty
> important: 'b NEAR a' doesn't return the same result set as 'a NEAR b'.

As you noted earlier in this thread, there is no concensus about what a
proximity query is. :)

I did consider the fact that proximity might imply that order does not matter.
But I came down here: if I want order to matter, and the ProximityScorer ignores
order as you're suggesting, then I have no options. I can't limit my search to
'a NEAR b'.

If instead we leave the ProximityScorer as is, then this:

 (a NEAR b) OR (b NEAR a)

does what you're describing.

Consider too:

 (a NEAR b NEAR c)

which might be written as:

 "a b c"~10

What order should I consider there? 'a' within 10 positions of 'b' and 'c'? or
'b' within 10 positions of 'a' and 'c'? or... You see how the possibilities

I think simpler is better here: if you want order to not matter, then OR
together the various orders you might be interested in. In fact, I may offer
that as an option in the Search::Query::Parser, which could then do the ORing
programmatically. Likewise, if we choose to support the "a b"~N syntax in the KS
QueryParser, could do something similar.

I note that one of the Lucene classes you mentioned earlier[0] makes inOrder an
option. The Lucene PhraseScorer's slop feature, however, does seem to respect
order with no option otherwise.


> Superficial stylistic suggestion: I might propose "proximity" (or "nearness",
> but "proximity" is better) instead of "near" for the name of that parameter.
> Or alternately, "slop", but I understand why you went with nearness instead.

I like 'proximity' for consistency's sake. And yes, 'near' is not quite right.
How about 'within'? Or 'vicinity'?

>> In the end it was a one-line difference in the SI_winnow_anchors implementation
>> to get the near/slop feature working. I left the original implementation intact
>> and put a switch/case wrapper around it to leave the optimization (if any)
>> intact for phrases (near==1).
> This doesn't technically need changing, but to cut down on the duplicated
> code, the switch on self->near should theoretically happen here:

ah yes, that's much better.

Peter Karman  .  .

View raw message