lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Uncle <>
Subject Re: Matching on "owned" docs -- filter or query? Or sort?
Date Sun, 22 Jul 2012 17:33:05 GMT
Thanks for the reply.  I thought of using boosting, for example "((userId:14 AND title:have)^10
OR (title:have))" or "((userId:14^10 AND title:have) OR (title:have))" or something like that.
 However, there would still be duplicates (all 3 docs for "To Have and To Have Not" would
be included whereas I would only want the one I own to be there).  This also requires using
the scoring for sorting so I can't apply other sorting (I would want to sort the results secondarily
by title for example). I might be able to go this route, but it seems like some combination
of custom filtering and sorting would work better.

I thought of somehow doing an empty query to fetch all docs, sorting them to put docs with
the userId first, and then running a DuplicateFilter on title with KM_USE_FIRST_OCCURRENCE.
 This is the duplicate elimination behavior I want.  Then do a text search on the remainder.
 But this seems very expensive.


On Jul 22, 2012, at 11:33 AM, Erick Erickson wrote:

> Hmmm, what about simply boosting very high on owner, and probably
> grouping on title?
> If you boosted on owner, you wouldn't even have to index the title
> separately for each user, your "owner" field could be multivalued and
> contain _all_ the owner IDs. In that case you wouldn't have to group
> at all......
> Best
> Erick
> On Sun, Jul 22, 2012 at 11:06 AM, Uncle <> wrote:
>> I also posted this to StackOverflow, apologies if you see this twice.
>> I have a data set whereby documents are associated to a user id. Say that the documents
represent books, and each book can have one or more owner. I am indexing the titles with Lucene.
When searching, I want all results owned by me to be sorted at the top of the results before
results that are not owned by me. So the data might look like:
>> Owner ID       Book Title
>> --------             ----------
>> 13                   To Have and To Have Not
>> 14                   To Have and To Have Not
>> 19                   To Have and To Have Not
>> 18                   Have a Little Faith
>> 15                   Snow Crash
>> 17                   Snow Crash
>> 18                   Cryptonomicon
>> 14                   Of Mice And Men
>> 17                   Flash Crash
>> Say that my user id is 14 and I search on "have", I want to match on both "To Have
and To Have Not" and "Have a Little Faith", but "To Have and To Have Not" should show up higher
in my search results, because I own it.  Similarly, if I am user id 15 and search for "Crash",
I will match both "Snow Crash" and "want "Flash Crash", but "Snow Crash" should show up first
because I own it.  If I am user id 14 and I search for "crash", I would still get a match
for "Snow Crash" even though I don't own it.  If I did a fuzzy match for "a" which would match
almost all of these titles, I would see those that I own before I see the others.
>> I am a little stuck on whether this is a query, filter, custom sort, or some combination,
and how to get the best performance.  For example, if I could write a filter that eliminates
all duplicate titles, giving preference to those owned by me, I could then just perform a
search on the remainder (assuming that filters are applied before searches). Then, a custom
sort based on whether or not I own the doc would be straightforward.
>> But I am not sure how to implement the filter. It is not a simple DuplicateFilter
because it operates on two fields. It is similar to the security filter example in section
5.6.7 of Lucene in Action, except that I still want to be able to see documents that I don't
own, if I don't own a book with the same title. The custom filter in section 6.4 is also close,
but my problem is more complex because it depends on two fields.
>> While iterating over the documents, the filter would have to remember which titles
have been seen, and then keep the ones that I own. For example if it iterated over the values
above in order, it would see the title "To Have and To Have Not", not owned by me; and then
see the same title again, owned by me, and have to know that it should drop the first doc
and keep the second. I can't think of how to do this without using a lot of memory, essentially
keeping all titles in memory while iterating, which seems very expensive. It isn't a simple
"match" function because whether or not I match depends on the other documents in the set.
>> Thanks much for any guidance or info.
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message