lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Grant Ingersoll" <gsing...@syr.edu>
Subject Re: 'Sponsored' links
Date Sun, 15 Feb 2004 21:01:55 GMT
Does the sponsored information have to be in the index?  Couldn't you lookup the sponsor info
in a database (or something else) after getting back your
initial results and then re-sort the hit list, moving up the sponsored elements while maintaining
the rest of the results as is?  If your list of sponsors are truly that small, you could just
put 'em in a file and load the list into memory.

Seems then you don't have to re-index when your sponsorships change and you really have no
dependencies on Lucene with
trying to get boost values right, etc.

I guess this resembles #2.


>>> dbd@smart.net 02/15/04 03:49PM >>>
I am a newbie to Lucene, and this is my first serious posting
to Lucene-user.

This is to solicit comment upon the problem of supplying
a "sponsored links" capability within Lucene. This capability
would not affect at all which documents are returned by a query,
but would cause any 'sponsored' documents present among the
results to be displayed before other documents in the list
returned.

I have looked over the correspondence in Lucene-user, but
not found anything addressing this topic; if I have missed it,
please tell me where and when, and ignore the rest of this.

It seems to me that there are three ways to achieve the
capability:

1. Preset boost values for 'sponsored' documents, with an
    implied burden of reindexing when sponsors are modified.

2. Post-qualify documents present in the hit list for their
    sponsorship status, building a new hit list.

3. Modify the query to search using both the full query as
    an unsponsored boolean clause with the default boost value,
    and for each sponsor, to repeat the full query ANDed with
    that sponsor with the appropriate boost value.

Are there other strategies not considered?

Assuming a small list of sponsors (10 or fewer), and low
volatility amongst the sponsors (1 change / month or less)
which method is best?

I have been pursuing method #1, almost to the exclusion of
the others, but have encountered an unknown difficulty in the
implementation (separate posting).  In particular, while it is clear
that #3 is doable, I know nothing about the searching burden
added by multiplying the user's query by one plus the count of
sponsors.

Regarding #3, if my understanding is right, then:
    Sponsors name: s1, s2, s3 ...
             words or phrases: s1w1, s1w2, ... , s2w1, s2w2, ... , s3w1 ....
             boost values: s1v, s2v, s3v

    then given query q as user input, form:
             q
             or (q and (s1w1 | s1w2 | s1w3 | ...)^s1v)
             or (q and (s2w1 | s2w2 ...)^s2v)
             or (q and (s3w1 ...)^s3v)
Is this correct?

Does the strategy of search identify any kind of intermediate
sublist to speed up searching? (But then it would start to
resemble #2.)

Rolling ones own for #2 would run query q, and get the
HitCollector. Separately running queries for each of:
             s1w1 | s1w2 | s1w3 | ...,
             s2w1 | s2w2 ...
             s3w1 ...
and merge each hit collector with the one from query q.
(Just AND the bitsets???) Lastly adjust scores and form
a new composite HitCollecter.  By this time I have told
everyone much more than I know.

Stray thought:-- can HitCollectors be cached at application init?

There are many other questions regarding details of implementation,
but their proper place is another communication.

Just by preparing this document for dissemination has helped
greatly.  All and any comments are much appreciated.

Thank you all.



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org 
For additional commands, e-mail: lucene-user-help@jakarta.apache.org 



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message