incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan Kurz <n...@verse.com>
Subject Re: [lucy-dev] ClusterSearcher
Date Thu, 10 Nov 2011 22:03:24 GMT
On Mon, Nov 7, 2011 at 1:50 PM, Marvin Humphrey <marvin@rectangular.com> wrote:
> On Sun, Nov 06, 2011 at 08:39:51PM -0800, Dan Markham wrote:
>> ZeroMQ and Google's Protocol Buffers both looking great for building a
>> distributed search solution.
>
> The idea of normalizing our current ad-hoc serialization mechanism using
> Google Protocol Buffers seems interesting, though it looks like it might be a
> lot of work and messy besides.
>
> First, Protocol Buffers doesn't support C -- only C++, Python and Java -- so
> we'd have to write our own custom plugin.  Dunno how hard that is.

While I'm relying on Google rather than experience, I don't think that
C support is actually a problem.
There seem to be C bindings: http://code.google.com/p/protobuf-c/
Or roll your own:
http://blog.reverberate.org/2008/07/12/100-lines-of-c-that-can-parse-any-protocol-buffer/

> Second, the Protocol Buffers compiler is a heavy dependency -- too big to
> bundle.  We'd have to capture the generated source files in version control.

Alternatively, it could just be a dependency.  While I recognize your
desire to keep the core free of such, I think it's entirely reasonable
for LucyX packages to require outside libraries and tools.  The
question would be whether it's reasonable or desirable to relegate
ClusterSearch to non-core.

> Further investigation seems warranted.  It would sure be nice if we could lower
> our costs for developing and maintaining serialization routines.
>
On Mon, Nov 7, 2011 at 2:39 PM, Nick Wellnhofer <wellnhofer@aevum.de> wrote:
> MessagePack might be worth a look. See http://msgpack.org/

Yes, that looks good too.  I'm suggesting that we restrict ourselves
to Protocol Buffers, only that it should be possible to use them for
interprocess communication, among other options.  A good architecture
(in my opinion) would be one that allows the over-the-wire protocol to
change without requiring in-depth knowledge of Lucy's internals.  I
think the key is to have a clear definition of what "information" is
required by each layer of Lucy, rather than serializing and
deserializing raw objects.

> As for ZeroMQ, it's LGPL which pretty much rules it out for us -- nothing
> released under the Apache License 2.0 can have a required LGPL dependency.

You know these rules better than I do, but I often worry that your
interpretations are often stricter than required by Apache's legal
counsel.
There's room for optional dependencies: http://www.apache.org
/legal/resolved.html#optional
For example, it looks like Apache Thrift (another alternative protocol
to consider) isn't scared of ZeroMQ:
https://issues.apache.org/jira/browse/THRIFT-812

>> Regardless of the path we go for building / shipping clustered search
>> solution.  I'm mostly interested in the api's to the lower level lucy that
>> make it possible and how to make them better.
>
> Well, my main concern, naturally, is the potential burden of exposing low-level
> internals as public APIs, constraining future Lucy core development.

It's a good concern, and I'm not certain what Dan is envisioning, but
I'm hoping that improving the API's means _less_ exposure of the
internals.  Rather than passing around Searcher and Index objects
everywhere, I'd love to make it explicitly clear what information is
available to whom:  if a remote client doesn't return it, you can't
use it.  Instead of increasing exposure for remote clients, we'd
simplify the interface to local Searchers.

> If we actually had a working networking layer, we'd have a better idea about
> what sort of APIs we'd need to expose in order to facilitate alternate
> implementations.  Rapid-prototyping a networking layer in Perl under LucyX with
> a very conservative API exposure and without hauling in giganto dependencies
> might help with that. :)

Yes!  I don't want to stand in the way of progress.  Prototyping
something that works is a great idea.   I don't have the fear of
dependencies that you do, but if you think it's faster to build
something simple from the ground up rather than using a complex
existing package, have at it!

--nate

Mime
View raw message