lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley" <yo...@apache.org>
Subject Re: [PROPOSAL] index server project
Date Tue, 31 Oct 2006 21:03:21 GMT
On 10/30/06, Doug Cutting <cutting@apache.org> wrote:
> Yonik Seeley wrote:
> > On 10/18/06, Doug Cutting <cutting@apache.org> wrote:
> >> We assume that, within an index, a file with a given name is written
> >> only once.
> >
> > Is this necessary, and will we need the lockless patch (that avoids
> > renaming or rewriting *any* files), or is Lucene's current index
> > behavior sufficient?
>
> It's not strictly required, but it would make index synchronization a
> lot simpler. Yes, I was assuming the lockless patch would be committed
> to Lucene before this project gets very far.  Something more than that
> would be required in order to keep old versions, but this could be as
> simple as a Directory subclass that refuses to remove files for a time.

Or a snapshot (hard links) mechanism.
Lucene would also need a way to open a specific index version (rather
than just the latest), but I guess that could also be hacked into
Directory by hiding later "segments" files (assumes lockless is
committed).

> > It's unfortunate the master needs to be involved on every document add.
>
> That should not normally be the case.

Ahh... I had assumed that "id" in the following method was document id:
  IndexLocation getUpdateableIndex(String id);

I see now it's index id.

But what is index id exactly?  Looking at the example API you laid
down, it must be a single physical index (as opposed to a logical
index).  In which case, is it entirely up to the client to manage
multi-shard indicies?  For example, if we had a "photo" index broken
up into 3 shards, each shard would have a separate index id and it
would be up to the client to know this, and to query across the
different "photo0", "photo1", "photo2" indicies.  The master would
have no clue those indicies were related.  Hmmm, that doesn't work
very well for deletes though.

It seems like there should be the concept of a logical index, that is
composed of multiple shards, and each shard has multiple copies.

Or were you thinking that a cluster would only contain a single
logical index, and hence all different index ids are simply different
shards of that single logical index?  That would seem to be consistent
with ClientToMasterProtocol .getSearchableIndexes() lacking an id
argument.

> I was not imagining a real-time system, where the next query after a
> document is added would always include that document.  Is that a
> requirement?  That's harder.

Not real-time, but it would be nice if we kept it close to what Lucene
can currently provide.
Most people seem fine with a latency of minutes.

> At this point I'm mostly trying to see if this functionality would meet
> the needs of Solr, Nutch and others.
>

It depends on the project scope and how extensible things are.
It seems like the master would be a WAR, capable of running stand-alone.
What about index servers (slaves)?  Would this project include just
the interfaces to be implemented by Solr/Nutch nodes, some common
implementation code behind the interfaces in the form of a library, or
also complete standalone WARs?

I'd need to be able to extend the ClientToSlave protocol to add
additional methods for Solr (for passing in extra parameters and
returning various extra data such as facets, highlighting, etc).

> Must we include a notion of document identity and/or document version in
> the mechanism? Would that facillitate updates and coherency?

It doesn't need to be in the interfaces I don't think, so it depends
on the scope of the index server implementations.

-Yonik

Mime
View raw message