lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-user] Hits offset and search performarce
Date Mon, 12 Nov 2012 03:19:57 GMT
On Sat, Nov 10, 2012 at 11:13 AM, Thomas den Braber <thomas@delos.nl> wrote:
> I am migrating from Swish-e to Lucy, most of the features I have implemented
> without many troubles but I have found some (big) differences in performance
> when using the offset parameter in the hits object.
>
> If I do:
>
> my $hits = $searcher->hits(
>         query => $queryobj,
>         sort_spec => $sort_spec,
>         num_wanted => 20,
>         offset => 5000
> );
>
> The search time is much higher then with the offset set to 0. In Swish-e
> there is only a small increase in search time.

I don't know how Swish-e implements sorting of hits, but this is expected
behavior in Lucy.

While iterating through matches, the top hits are kept in a priority queue.
The size of the priority queue is `offset + num_wanted`.  The bigger the
priority queue, the more MatchDoc objects which have to be created to serve as
elements within the priority queue, the more comparison operations which have
to take place to keep the priority queue sorted, etc -- impacting both speed
and memory footprint.

> I found out that when running with  num_wanted => 5000 and offset => 0 the
> search time is almost the same as with num_wanted => 20 and offset => 5000.
> It looks like there is no optimization when using 'offset'.

In one case, the priority queue has a size of 5000, and in the other it has a
size of 5020 -- not much difference!

(Note: we don't retrieve any of the actual document content until later --
individual Doc objects are fetched on demand with each call to `$hits->next`.)

> I would expect that using the offset, performance should be higher because
> no processing needs to be done to the hits before the offset (no score
> calculation).

How do you know that the hit number 5000 actually ranks 5000th in sort order
unless you calculate scores for all documents and perform sorting?

There are certain times when Lucy can avoid calculating scores -- when
SortSpecs do not require scores, or when documents match pure negative clauses
(docs matching "bar" in the query `foo AND NOT bar`).  But when you are
ranking documents based on score, we have to calculate a score for **every**
document.

> Is there a way to improve the search performance when using the 'offset' ?

Well, you could say that we already do. :)  The shallower your search, the
smaller we can make the priority queue!

> Also I am missing the seek command (that exists in swish-e) to fast forward
> through my results or did I overlook something ?

I would assume that Swish-e and Lucy are implemented differently.  I don't
know what seek() does in the context of Swish-e.

> But in general I am very satisfied with Lucy so far. Thanks for the work guys !

:)

Marvin Humphrey

Mime
View raw message