lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-1494) Additional features for searching for value across multiple fields (many-to-one style)
Date Thu, 30 Apr 2009 23:43:31 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hoss Man updated LUCENE-1494:
-----------------------------

    Attachment: LUCENE-1494-masking.patch

some things looked like they wouldn't work with the masking patch, so i wrote some test cases
to convince myself they were broken (and because new code should always have test cases).
 In particular i was worried about the lack of equals/hashCode methods, and the broken rewrite
method

one interesting thing I discovered was that the code worked in many cases even though rewrite
was constantly just returning the masked inner query -- digging into it i realized the reason
was because none of the other SpanQuery classes pay any attention to what their nested clauses
return when they recursively rewrite, so a SpanNearQuery whose constructor freaks out if the
fields of all the clauses don't match, happily generates spans if one of those clauses returns
a complteley different SpanQuery on rewrite.

I also removed the proxying of getBoost and setBoost ... it was causing problems with some
stock testing framework code that expects a q1.equals(q1.clone().setBoost(newBoost)) to be
false (this was evaluating to true because it's a shallow clone and setBoost was proxying
and modifying the original inner query's boost value) ... this means that FieldMaskingSpanQuery
is consistent with how other SpanQueries deal with boost (they ignore the boosts of their
nested clauses)

new patch (with tests) attached ... i'd like to have some more tests before committing (spans
is deep voodoo, we're doing funky stuff with spans, all the more reason to test thoroughly)

> Additional features for searching for value across multiple fields (many-to-one style)
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1494
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1494
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 2.4
>            Reporter: Paul Cowan
>            Priority: Minor
>         Attachments: LUCENE-1494-masking.patch, LUCENE-1494-masking.patch, LUCENE-1494-multifield.patch,
LUCENE-1494-positionincrement.patch
>
>
> This issue is to cover the changes required to do a search across multiple fields with
the same name in a fashion similar to a many-to-one database. Below is my post on java-dev
on the topic, which details the changes we need:
> ---
> We have an interesting situation where we are effectively indexing two 'entities' in
our system, which share a one-to-many relationship (imagine 'User' and 'Delivery Address'
for demonstration purposes). At the moment, we index one Lucene Document per 'many' end, duplicating
the 'one' end data, like so:
>     userid: 1
>     userfirstname: fred
>     addresscountry: au
>     addressphone: 1234
>     userid: 1
>     userfirstname: fred
>     addresscountry: nz
>     addressphone: 5678
>     userid: 2
>     userfirstname: mary
>     addresscountry: au
>     addressphone: 5678
> (note: 2 Documents indexed for user 1). This is somewhat annoying for us, because when
we search in Lucene the results we want back (conceptually) are at the 'user' level, so we
have to collapse the results by distinct user id, etc. etc (let alone that it blows out the
size of our index enormously). So why do we do it? It would make more sense to use multiple
fields:
>     userid: 1
>     userfirstname: fred
>     addresscountry: au
>     addressphone: 1234
>     addresscountry: nz
>     addressphone: 5678
>     userid: 2
>     userfirstname: mary
>     addresscountry: au
>     addressphone: 5678
> But imagine the search "+addresscountry:au +addressphone:5678". We'd like this to match
ONLY Mary, but of course it matches Fred also because he matches both those terms (just for
different addresses).
> There are two aspects to the approach we've (more or less) got working but I'd like to
run them past the group and see if they're worth trying to get them into Lucene proper (if
so, I'll create a JIRA issue for them)
> 1) Use a modified SpanNearQuery. If we assume that country + phone will always be one
token, we can rely on the fact that the positions of 'au' and '5678' in Fred's document will
be different.
>    SpanQuery q1 = new SpanTermQuery(new Term("addresscountry", "au"));
>    SpanQuery q2 = new SpanTermQuery(new Term("addressphone", "5678"));
>    SpanQuery snq = new SpanNearQuery(new SpanQuery[]{q1, q2}, 0, false);
> the slop of 0 means that we'll only return those where the two terms are in the same
position in their respective fields. This works brilliantly, BUT requires a change to SpanNearQuery's
constructor (which checks that all the clauses are against the same field). Are people amenable
to perhaps adding another constructor to SNQ which doesn't do the check, or subclassing it
to do the same (give it a protected non-checking constructor for the subclass to call)?
> 2) It gets slightly more complicated in the case of variable-length terms. For example,
imagine if we had an 'address' field ('123 Smith St') which will result in (1 to n) tokens;
slop 0 in a SpanNearQuery won't work here, of course. One thing we've toyed with is the idea
of using getPositionIncrementGap -- if we knew that 'address' would be, at most, 20 tokens,
we might use a position increment gap of 100, and make the slop factor 50; this works fine
for the simple case (yay!), but with a great many addresses-per-user starts to get more complicated,
as the gap counts from the last term (so the position sequence for a single value field might
be 0, 100, 200, but for the address field it might be 0, 1, 2, 3, 103, 104, 105, 106, 206,
207... so it's going to get out of sync). The simplest option here seems to be changing (or
supplementing)
>    public int getPositionIncrementGap(String fieldname)
> to
>    public int getPositionIncrementGap(String fieldname, int currentPos)
> so that we can override that to round up to the nearest 100 (or whatever) based on currentPos.
The default implementation could just delegate to getPositionIncrementGap().
> ---
> Patches (x2) to follow shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message