lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-user] SearchServer / ClusterSearcher - massive performance hit
Date Thu, 25 Oct 2012 01:44:28 GMT
Hi, Dag,

On Wed, Oct 24, 2012 at 5:08 AM, Dag Lem <dag@nimrod.no> wrote:

>> Some observations:

Thanks for the research and the thoughtful suggestions.

>> * Lucy::Search::IndexSearcher::top_docs (used by SearchServer) is
>>   about twice as slow Lucy::Search::Searcher::hits (used by
>>   IndexSearcher).

Well, this may come as a surprise in light of your benchmarks, but
Searcher#hits() calls top_docs() internally. :)

    http://s.apache.org/vH  (link to git-wip-us.apache.org)

For the record, Searcher is IndexSearcher's parent class; IndexSearcher
inherits hits() but provides its own implementation of top_docs().  A fair
benchmark would involve comparing the results of top_docs() and hits() on a
single IndexSearcher -- and it would be very surprising if hits() was faster.

I suspect the at least some of the discrepancies you are seeing arise because:

*   IndexSearcher is a mature class implemented primarily in C.
*   ClusterSearcher is a comparatively young class implemented in Perl.

Should we port ClusterSearcher to C, I expect that we'll see some of the
performance anomalies smooth out.  However, I don't think we should focus on
that yet, because ClusterSearcher's architecture is not yet optimal -- and it
will be easier to refactor if we keep it in Perl for now.

> 1. Get rid of as many network roundtrips as possible.

+1

We'll need to address the sources of network traffic case-by-case, though.

> 2. Design a (simple) custom application protocol, to get rid of the
>    overhead of Storable.

Lucy has a custom hook written for Storable that wraps an internal
serialization mechanism.  If we're not going to go through the Storable
wrapper, we'll need to use the underlying mechanism directly, because Lucy
objects are mostly implemented in C and their internals are not directly
accessible from Perl-space.

This is doable from either Perl or C, but I concur that it should be a
lower priority than dealing with the network round-trips.

> As far as I can tell, the current protocol covers the following
> actions:
>
>   handshake
>   terminate
>   doc_max
>   doc_freq
>   top_docs
>   fetch_doc
>   fetch_doc_vec
>
> Here, doc_freq and top_docs should be replaced with something like
> docs_freq_and_top_docs, i.e. only one request / response per query.

I'll address this point in a separate email.

> Furthermore fetch_doc and fetch_doc_vec should be replaced with
> something like fetch_docs and fetch_docs_vec, facilitating the
> fetching of several documents with a single request / response.

This one too.

Marvin Humphrey

Mime
View raw message