incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Markham <dmark...@gmail.com>
Subject [lucy-dev] Re: [lucy-user] ClusterSearcher
Date Mon, 07 Nov 2011 04:39:51 GMT
I'm so looking forward to this discussion. 

We have built a closed source multi-master system with replication.  Currently we are not
using a PolySearcher to query more than one server.  Each box has a full index copy.  What
we have done is multiplex queries within a index on a single box by using different processes
per segment (with help from marvin). It's been pretty slick if your index isn't larger than
ram  and you have a few spare cpu cores to spread the workout.
 
ZeroMQ and Google's Protocol Buffers both looking great for building a distributed search
solution. 

Regardless of the path we go for building /  shipping clustered search solution. 
I'm mostly interested in the api's to the lower level lucy that make it possible and how to
make them better. I'm sure few will have my exact use-cases so flexibility in the core Lucy
is key for me.

Challenges we have seen in regards to distributed search. 
1. Keeping the snapshot around long enough for a searcher to comeback and ask for doc_ids.
       Our index moves quickly (real-time) many docs/segments a second. 
        This issue is mainly a issue because we insist in reopening the index for every write
for us to maintain a real-time feel.
2. Replication.
    One copy never seems to be enough (boxes crash,networking,high-load you name it) so replication
of 
    data to other boxes and keeping the perception of real time is always a challenge for
us.  

I'm sure once we flush out the plan... we'll have lots of fun things to chat about deletes/sort-caches/TF
IDF cache.


I'm all for any api's that help in replication, maintaining indexes in distributed setups.


-Dan


> For Lucy as a whole, I think there are some meta-questions that should
> be resolved before we go down this path.
> 
> 1) How core is is this to Lucy's functionality?
> 2) How much should we depend on outside libraries?
> 3) How independent should the Searcher and the Clients be?
> 4) How future-proof and scalable do we want this solution to be?
> 
> My position would be that while search clusters are essential to Lucy,
> our core competency is fast search rather than reliable networking,
> and thus we should use well-tested external libraries rather than
> expanding our scope.  I think the remote Clients and the central
> Searcher should be essentially independent of each other and of this
> networking layer.   And I think that we should aim to make it scale to
> the moon.
> 
> Fleshing this out a little bit, I think we should prefer libev in C
> over IO::Select in Perl, and that  that we should prefer something
> high level like ZeroMQ over dealing with libev.  I think we should
> have a well defined query and response format using something like
> Google's Protocol Buffers rather than serializing objects directly.  I
> think a good goal would be allowing Lucene with a wrapper to act as a
> Client.
> 
> Marvin: could you offer an high level overview of how cluster search
> would work ideally, with particular emphasis on what gets passed over
> the wire and what out-of-band coordination is needed between Searcher
> and Clients?
> 
> --nate


Mime
View raw message