lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel B. Davis" <>
Subject Re: 'Sponsored' links
Date Mon, 16 Feb 2004 22:06:06 GMT
The index contains documents, an unknown number of which are 
sponsored.  The number of sponsors are small, not necessarily the documents 
count.  In all of #1, #2, and #3, the sponsorship information must be 
accessed to determine sponsorship, and that information is indeed outside 
of the primary document index; Lucene is used as a convenient way to store 
Sponsor's name, boost value, and words or phrases which determine 
sponsorship, but a database could easily be used instead.  I used Lucene 
instead both because I wanted to get deeper into using Lucene, and because 
storing the data with Lucene holds it closer to its eventual use.

At 04:01 PM 2/15/2004 -0500, you wrote:

>Does the sponsored information have to be in the index?  Couldn't you 
>lookup the sponsor info in a database (or something else) after getting 
>back your
>initial results and then re-sort the hit list, moving up the sponsored 
>elements while maintaining the rest of the results as is?  If your list of 
>sponsors are truly that small, you could just put 'em in a file and load 
>the list into memory.
>Seems then you don't have to re-index when your sponsorships change and 
>you really have no dependencies on Lucene with
>trying to get boost values right, etc.
>I guess this resembles #2.
> >>> 02/15/04 03:49PM >>>
>I am a newbie to Lucene, and this is my first serious posting
>to Lucene-user.
>This is to solicit comment upon the problem of supplying
>a "sponsored links" capability within Lucene. This capability
>would not affect at all which documents are returned by a query,
>but would cause any 'sponsored' documents present among the
>results to be displayed before other documents in the list
>I have looked over the correspondence in Lucene-user, but
>not found anything addressing this topic; if I have missed it,
>please tell me where and when, and ignore the rest of this.
>It seems to me that there are three ways to achieve the
>1. Preset boost values for 'sponsored' documents, with an
>     implied burden of reindexing when sponsors are modified.
>2. Post-qualify documents present in the hit list for their
>     sponsorship status, building a new hit list.
>3. Modify the query to search using both the full query as
>     an unsponsored boolean clause with the default boost value,
>     and for each sponsor, to repeat the full query ANDed with
>     that sponsor with the appropriate boost value.
>Are there other strategies not considered?
>Assuming a small list of sponsors (10 or fewer), and low
>volatility amongst the sponsors (1 change / month or less)
>which method is best?
>I have been pursuing method #1, almost to the exclusion of
>the others, but have encountered an unknown difficulty in the
>implementation (separate posting).  In particular, while it is clear
>that #3 is doable, I know nothing about the searching burden
>added by multiplying the user's query by one plus the count of
>Regarding #3, if my understanding is right, then:
>     Sponsors name: s1, s2, s3 ...
>              words or phrases: s1w1, s1w2, ... , s2w1, s2w2, ... , s3w1 ....
>              boost values: s1v, s2v, s3v
>     then given query q as user input, form:
>              q
>              or (q and (s1w1 | s1w2 | s1w3 | ...)^s1v)
>              or (q and (s2w1 | s2w2 ...)^s2v)
>              or (q and (s3w1 ...)^s3v)
>Is this correct?
>Does the strategy of search identify any kind of intermediate
>sublist to speed up searching? (But then it would start to
>resemble #2.)
>Rolling ones own for #2 would run query q, and get the
>HitCollector. Separately running queries for each of:
>              s1w1 | s1w2 | s1w3 | ...,
>              s2w1 | s2w2 ...
>              s3w1 ...
>and merge each hit collector with the one from query q.
>(Just AND the bitsets???) Lastly adjust scores and form
>a new composite HitCollecter.  By this time I have told
>everyone much more than I know.
>Stray thought:-- can HitCollectors be cached at application init?
>There are many other questions regarding details of implementation,
>but their proper place is another communication.
>Just by preparing this document for dissemination has helped
>greatly.  All and any comments are much appreciated.
>Thank you all.
>To unsubscribe, e-mail:
>For additional commands, e-mail:
>To unsubscribe, e-mail:
>For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message