lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From balasubramanian sudaakeran <>
Subject Re: Matching on "owned" docs -- filter or query? Or sort?
Date Mon, 23 Jul 2012 01:17:41 GMT
On the boosting approach, you can have a mandatory field of title match and optional match
of userId with very high boost. This would have duplicates but you don't need to do sorting
to remove it. Just keep adding the result in the order it comes and if you see that the title
is already there in the list you do not add it again.  The pitfall with this approach would
be if you change your title match to do more things like spell mistakes etc.. then you need
to worry about which one to show first.

 From: Uncle <>
Sent: Sunday, July 22, 2012 11:03 PM
Subject: Re: Matching on "owned" docs -- filter or query? Or sort?
Thanks for the reply.  I thought of using boosting, for example "((userId:14 AND title:have)^10
OR (title:have))" or "((userId:14^10 AND title:have) OR (title:have))" or something like that. 
However, there would still be duplicates (all 3 docs for "To Have and To Have Not" would be
included whereas I would only want the one I own to be there).  This also requires using
the scoring for sorting so I can't apply other sorting (I would want to sort the results secondarily
by title for example). I might be able to go this route, but it seems like some combination
of custom filtering and sorting would work better.

I thought of somehow doing an empty query to fetch all docs, sorting them to put docs with
the userId first, and then running a DuplicateFilter on title with KM_USE_FIRST_OCCURRENCE. 
This is the duplicate elimination behavior I want.  Then do a text search on the remainder. 
But this seems very expensive.


On Jul 22, 2012, at 11:33 AM, Erick Erickson wrote:

> Hmmm, what about simply boosting very high on owner, and probably
> grouping on title?
> If you boosted on owner, you wouldn't even have to index the title
> separately for each user, your "owner" field could be multivalued and
> contain _all_ the owner IDs. In that case you wouldn't have to group
> at all......
> Best
> Erick
> On Sun, Jul 22, 2012 at 11:06 AM, Uncle <> wrote:
>> I also posted this to StackOverflow, apologies if you see this twice.
>> I have a data set whereby documents are associated to a user id. Say that the documents
represent books, and each book can have one or more owner. I am indexing the titles with Lucene.
When searching, I want all results owned by me to be sorted at the top of the results before
results that are not owned by me. So the data might look like:
>> Owner ID       Book Title
>> --------             ----------
>> 13                   To Have and To Have Not
>> 14                   To Have and To Have Not
>> 19                   To Have and To Have Not
>> 18                   Have a Little Faith
>> 15                   Snow Crash
>> 17                   Snow Crash
>> 18                   Cryptonomicon
>> 14                   Of Mice And Men
>> 17                   Flash Crash
>> Say that my user id is 14 and I search on "have", I want to match on both "To Have
and To Have Not" and "Have a Little Faith", but "To Have and To Have Not" should show up higher
in my search results, because I own it.  Similarly, if I am user id 15 and search for "Crash",
I will match both "Snow Crash" and "want "Flash Crash", but "Snow Crash" should show up first
because I own it.  If I am user id 14 and I search for "crash", I would still get a match
for "Snow Crash" even though I don't own it.  If I did a fuzzy match for "a" which would
match almost all of these titles, I would see those that I own before I see the others.
>> I am a little stuck on whether this is a query, filter, custom sort, or some combination,
and how to get the best performance.  For example, if I could write a filter that eliminates
all duplicate titles, giving preference to those owned by me, I could then just perform a
search on the remainder (assuming that filters are applied before searches). Then, a custom
sort based on whether or not I own the doc would be straightforward.
>> But I am not sure how to implement the filter. It is not a simple DuplicateFilter
because it operates on two fields. It is similar to the security filter example in section
5.6.7 of Lucene in Action, except that I still want to be able to see documents that I don't
own, if I don't own a book with the same title. The custom filter in section 6.4 is also close,
but my problem is more complex because it depends on two fields.
>> While iterating over the documents, the filter would have to remember which titles
have been seen, and then keep the ones that I own. For example if it iterated over the values
above in order, it would see the title "To Have and To Have Not", not owned by me; and then
see the same title again, owned by me, and have to know that it should drop the first doc
and keep the second. I can't think of how to do this without using a lot of memory, essentially
keeping all titles in memory while iterating, which seems very expensive. It isn't a simple
"match" function because whether or not I match depends on the other documents in the set.
>> Thanks much for any guidance or info.
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message