lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Searching in same position across multiple fields
Date Tue, 16 Dec 2008 20:41:38 GMT

: 1) Use a modified SpanNearQuery. If we assume that country + phone will always
: be one token, we can rely on the fact that the positions of 'au' and '5678' in
: Fred's document will be different.
: 
:    SpanQuery q1 = new SpanTermQuery(new Term("addresscountry", "au"));
:    SpanQuery q2 = new SpanTermQuery(new Term("addressphone", "5678"));
:    SpanQuery snq = new SpanNearQuery(new SpanQuery[]{q1, q2}, 0, false);
: 
: the slop of 0 means that we'll only return those where the two terms are in
: the same position in their respective fields. This works brilliantly, BUT
: requires a change to SpanNearQuery's constructor (which checks that all the
: clauses are against the same field). Are people amenable to perhaps adding
: another constructor to SNQ which doesn't do the check, or subclassing it to do
: the same (give it a protected non-checking constructor for the subclass to
: call)?

this has actually come up a couple of times over the years (i think Doug 
was the first person i ever heard suggest it) in the context of 
PhraseQuery ... the initial thought was that just removing the 
term1.field=term2.field assertion would allow something liek this to work, 
but i don't think anyone every tried creating a patch w/tests to verify 
it.

I think it would be a great idea.

: 2) It gets slightly more complicated in the case of variable-length terms. For
	...
: getPositionIncrementGap -- if we knew that 'address' would be, at most, 20
: tokens, we might use a position increment gap of 100, and make the slop factor
: 50; this works fine for the simple case (yay!), but with a great many
: addresses-per-user starts to get more complicated, as the gap counts from the
: last term (so the position sequence for a single value field might be 0, 100,
: 200, but for the address field it might be 0, 1, 2, 3, 103, 104, 105, 106,
: 206, 207... so it's going to get out of sync). The simplest option here seems

couldn't this be solved by an Analyzer that counts the token per fieldname 
and implements getPositionIncrementGap as..

	int result - SOME_BIG_NUM - tokensSeenMap.get(fieldname);
	tokensSeenMap.put(fieldname, 0);
	return result;

?


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message