incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject [lucy-dev] ClusterSearcher
Date Sat, 05 Nov 2011 18:23:03 GMT
cc to lucy-dev...

On Sat, Nov 05, 2011 at 08:28:46AM +0200, goran kent wrote:
> On 11/4/11, Marvin Humphrey <marvin@rectangular.com> wrote:
> > Sounds like the nodes are being accessed serially rather than in parallel.
> > I'll look into it.
> 
> I'd love to sniff around with some debug prints, etc, can you point me
> to the relevant code where this might (not) be occurring?

The serialized requests are initiated by Polysearcher -- see
PolySearcher_top_docs() in the source file
trunk/core/Lucy/Search/PolySearcher.c.  Here is the problematic loop, with
explanatory comments inserted.

    // Loop over an array of Searcher objects.  In this case, each of the
    // Searchers is a LucyX::Remote::SearchClient.
    for (i = 0, max = VA_Get_Size(searchers); i < max; i++) {
        // Extract an individual Searcher and its corresponding doc id offset.
        Searcher   *searcher   = (Searcher*)VA_Fetch(searchers, i);
        int32_t     base       = I32Arr_Get(starts, i);
        // This line triggers a call to the top_docs() subroutine within
        // SearchClient.pm.  It blocks until top_docs() returns, and thus the
        // total time to process all remote requests in this loop is the sum
        // of all child node response times.
        TopDocs    *top_docs   = Searcher_Top_Docs(searcher, (Query*)compiler,
                                                   num_wanted, sort_spec);
        /* ... */
    }

To process the searches in parallel, we need a select loop[1].  However,
PolySearcher can only access SearchClient via the abstract
Lucy::Search::Searcher interface -- it knows nothing about the socket calls
that are being made by SearchClient.pm.  PolySearcher would have to pierce
encapsulation in order to get at those sockets and multiplex the requests.

The most straightforward solution is to eliminate PolySearcher from the
equation and to create a class that combines the functionality of PolySearcher
and SearchClient.  Fortunately, neither of them is particularly large or
complex, so the task is very doable.

I propose that we name this new class LucyX::Remote::ClusterSearcher.  

  * Fork SearchClient.pm to ClusterSearcher.pm and t/510-remote.t to
    t/550-cluster_searcher.t.
  * Give ClusterSearcher the ability to talk to multiple SearchServers.
  * Change to a two-stage RPC mechanism:
    1. Fire off the requests to the individual SearchServers in a "for" loop.
    2. Gather the responses into an array using a select() loop (powered by an 
       IO::Select object).
  * Adapt each of the Searcher methods that ClusterSearcher implements to
    assemble a sensible return value from the array of responses using
    PolySearcher's techniques.

This won't be the end of our iterating if we want to build a robust clustering
system, because it doesn't yet address either node availability issues or
near-real-time updates.  However, it provides the functionality that we meant
to make available via PolySearcher/SearchServer/SearchClient, allowing Goran
to evaluate whether the system meets his basic requirements, and moves us
incrementally towards a highly desirable goal: a ClusterSearcher object backed
by multiple search nodes that is just as easy to use as an IndexSearcher
backed by one index on one machine.

PS: Goran...I'm under the weather right now, so if you're counting on me to
code this up, I'm not sure how quickly I'll get to it.

Marvin Humphrey

[1] http://www.perlfect.com/articles/select.shtml


Mime
View raw message