lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Cowan <co...@aconex.com>
Subject Searching in same position across multiple fields
Date Tue, 16 Dec 2008 03:14:20 GMT
Hi all,

(All examples below are using Lucene 2.2; if things have changed in 
later versions please adjust accordingly, though a quick check of the 
classes involved shows no major changes in trunk)

We have an interesting situation where we are effectively indexing two 
'entities' in our system, which share a one-to-many relationship 
(imagine 'User' and 'Delivery Address' for demonstration purposes). At 
the moment, we index one Lucene Document per 'many' end, duplicating the 
'one' end data, like so:

	userid: 1
	userfirstname: fred
	addresscountry: au
	addressphone: 1234

	userid: 1
	userfirstname: fred
	addresscountry: nz
	addressphone: 5678

	userid: 2
	userfirstname: mary
	addresscountry: au
	addressphone: 5678

(note: 2 Documents indexed for user 1). This is somewhat annoying for 
us, because when we search in Lucene the results we want back 
(conceptually) are at the 'user' level, so we have to collapse the 
results by distinct user id, etc. etc (let alone that it blows out the 
size of our index enormously). So why do we do it? It would make more 
sense to use multiple fields:
	userid: 1
	userfirstname: fred
	addresscountry: au
	addressphone: 1234
	addresscountry: nz
	addressphone: 5678

	userid: 2
	userfirstname: mary
	addresscountry: au
	addressphone: 5678

But imagine the search "+addresscountry:au +addressphone:5678". We'd 
like this to match ONLY Mary, but of course it matches Fred also because 
he matches both those terms (just for different addresses).

There are two aspects to the approach we've (more or less) got working 
but I'd like to run them past the group and see if they're worth trying 
to get them into Lucene proper (if so, I'll create a JIRA issue for them)

1) Use a modified SpanNearQuery. If we assume that country + phone will 
always be one token, we can rely on the fact that the positions of 'au' 
and '5678' in Fred's document will be different.

    SpanQuery q1 = new SpanTermQuery(new Term("addresscountry", "au"));
    SpanQuery q2 = new SpanTermQuery(new Term("addressphone", "5678"));
    SpanQuery snq = new SpanNearQuery(new SpanQuery[]{q1, q2}, 0, false);

the slop of 0 means that we'll only return those where the two terms are 
in the same position in their respective fields. This works brilliantly, 
BUT requires a change to SpanNearQuery's constructor (which checks that 
all the clauses are against the same field). Are people amenable to 
perhaps adding another constructor to SNQ which doesn't do the check, or 
subclassing it to do the same (give it a protected non-checking 
constructor for the subclass to call)?

2) It gets slightly more complicated in the case of variable-length 
terms. For example, imagine if we had an 'address' field ('123 Smith 
St') which will result in (1 to n) tokens; slop 0 in a SpanNearQuery 
won't work here, of course. One thing we've toyed with is the idea of 
using getPositionIncrementGap -- if we knew that 'address' would be, at 
most, 20 tokens, we might use a position increment gap of 100, and make 
the slop factor 50; this works fine for the simple case (yay!), but with 
a great many addresses-per-user starts to get more complicated, as the 
gap counts from the last term (so the position sequence for a single 
value field might be 0, 100, 200, but for the address field it might be 
0, 1, 2, 3, 103, 104, 105, 106, 206, 207... so it's going to get out of 
sync). The simplest option here seems to be changing (or supplementing)
    public int getPositionIncrementGap(String fieldname)
to
    public int getPositionIncrementGap(String fieldname, int currentPos)
so that we can override that to round up to the nearest 100 (or 
whatever) based on currentPos. The default implementation could just 
delegate to getPositionIncrementGap().

What do people think? Is this ugly, or worth pursuing? Does anyone have 
any other, better ideas? I was curious as to whether Hibernate Search 
deals with this problem, in terms of many-to-one relationships. However, 
it's actually not clear from the documentation whether it actually DOES 
or not, so if anyone has insight into that that would be great.

Thanks in advance,

Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message