incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-dev] ClusterSearcher
Date Mon, 07 Nov 2011 21:50:22 GMT
On Sun, Nov 06, 2011 at 08:39:51PM -0800, Dan Markham wrote:
> ZeroMQ and Google's Protocol Buffers both looking great for building a
> distributed search solution.

The idea of normalizing our current ad-hoc serialization mechanism using
Google Protocol Buffers seems interesting, though it looks like it might be a
lot of work and messy besides.

First, Protocol Buffers doesn't support C -- only C++, Python and Java -- so
we'd have to write our own custom plugin.  Dunno how hard that is.

Second, the Protocol Buffers compiler is a heavy dependency -- too big to
bundle.  We'd have to capture the generated source files in version control.
That's theoretically doable -- it's how we're handling the Flex file which is
part of the Clownfish compiler -- but that one Flex file isn't likely to
change much from here on out, whereas developing serialization routines is an
ongoing task.

Further investigation seems warranted.  It would sure be nice if we could lower
our costs for developing and maintaining serialization routines.

As for ZeroMQ, it's LGPL which pretty much rules it out for us -- nothing
released under the Apache License 2.0 can have a required LGPL dependency.

In contrast, the libev license looks compatible:

    http://cvs.schmorp.de/libev/LICENSE?view=markup

Any networking layer that is going to require a dependency like libev should
be released separately from Lucy, though.

> Regardless of the path we go for building / shipping clustered search
> solution.  I'm mostly interested in the api's to the lower level lucy that
> make it possible and how to make them better.

Well, my main concern, naturally, is the potential burden of exposing low-level
internals as public APIs, constraining future Lucy core development.

If we actually had a working networking layer, we'd have a better idea about
what sort of APIs we'd need to expose in order to facilitate alternate
implementations.  Rapid-prototyping a networking layer in Perl under LucyX with
a very conservative API exposure and without hauling in giganto dependencies
might help with that. :)

> I'm sure few will have my exact use-cases so flexibility in the core Lucy is
> key for me.

I'm not convinced that we will be unable to meet those needs. :)

> 1. Keeping the snapshot around long enough for a searcher to comeback and ask for doc_ids.
>        Our index moves quickly (real-time) many docs/segments a second.
>        This issue is mainly a issue because we insist in reopening the index
>        for every write for us to maintain a real-time feel.

This can be achieved if we mod Lucy to enable deletion policies that leave
obsolete snapshots around for some amount of time.

> 2. Replication.
>     One copy never seems to be enough (boxes crash,networking,high-load you
>     name it) so replication of data to other boxes and keeping the
>     perception of real time is always a challenge for us.  
 
> I'm sure once we flush out the plan... we'll have lots of fun things to chat
> about deletes/sort-caches/TF IDF cache.

No doubt. :)
 
> I'm all for any api's that help in replication, maintaining indexes in
> distributed setups.

We'll certainly need this eventually, but I think that we can get distibuted
search functionality working first and then follow up with the indexing layer
later.

Marvin Humphrey


Mime
View raw message