lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Random selection of files
Date Mon, 04 Feb 2008 15:10:33 GMT
Well, assuming that by "same weight" you are referring to the document
scores (relevance), you certainly have to do the search first. But you can
use TopDocs to get a list of the document IDs arranged by decreasing
score i.e. sorted by relevance.

But "same weight" is tricky. It's virtually certain that your first 1,000
documents
will NOT have the same relevance score. Whether they differ by a little or a
lot
is entirely dependent upon the query and your corpus. Note that TopDocs
scores
are NOT normalized, although you can normalize them because TopDocs will
max score of any doc in this search. Si O assume you'll have to make a
decision
how many of the docs constitute the set relevant enough to have in your set
of choices.

So, it should be pretty straightforward to get a list of the top N document
IDs.
If speed is an issue, think about lazy loading the fields you need to
extract
from each document.

But you certainly don't want to do this with a Hits object for any number
greater
than about 100, since the Hits object will re-execute the query every 100
docs or
so.

Best
Erick


On Feb 4, 2008 6:37 AM, Juerg Meier <Juerg.Meier@ctp-consulting.com> wrote:

> Hi,
>
> We have the requirement for an "i'm feeling lucky" button, at least sort
> of. Whereas google just delivers the first record in a result set, we should
> deliver 10 arbitrary hits chosen out of, let's say, 1000. All of these
> documents have the same importance i.e. have the same weight.
>
> So, is there an elegant way with the Lucene API to achieve this? Or do we
> need to retrieve all 1000 docs first, to do a random selection on our own
> afterwards? That appears to be quite expensive.
>
> Thanks for any hint,
> -- Juerg
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message