hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Trivial Update of "DistributedLucene" by MarkButler
Date Wed, 19 Dec 2007 11:48:53 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by MarkButler:
http://wiki.apache.org/lucene-hadoop/DistributedLucene

------------------------------------------------------------------------------
  
  === Issues to be discussed ===
  
- 1. How do searches work?
+ ==== 1. How do searches work? ====
  
  Searches could be broadcast to one index with each id and return merged results. The client
will load-balance both searches and updates.
  
- 2. How do deletions work?
+ ==== 2. How do deletions work? ====
  
  Deletions could be broadcast to all slaves. That would probably be fast enough. Alternately,
indexes could be partitioned by a hash of each document's unique id, permitting deletions
to be routed to the appropriate slave.
  
- 3. How does update work?
+ ==== 3. How does update work? ====
  
  The master should be out of the loop as much as possible. One approach is that clients randomly
assign documents to indexes and send the updates directly to the indexing node. Alternately,
clients might index locally, then ship the updates to a node packaged as an index. That was
the intent of the addIndex method.
  
@@ -96, +96 @@

  
  Good point. Either the two clients must coordinate, to make sure that they're not updating
the same document at the same time, or use a strategy where updates are routed to the slave
that contained the old version of the document. That would require a broadcast query to figure
out which slave that is.
  
- 4. How do additions work?
+ ==== 4. How do additions work? ====
  
  The master should not be involved in adds. Clients can cache the set of writable index locations
and directly submit new documents without involving the master.
  
  The master should be out of the loop as much as possible. One approach is that clients randomly
assign documents to indexes and send the updates directly to the indexing node. Alternately,
clients might index locally, then ship the updates to a node packaged as an index. That was
the intent of the addIndex method.
  
- 5. How do commits work?
+ ==== 5. How do commits work? ====
  
  It seems like the master might want to be involved in commits too, or maybe we just rely
on the slave to master heartbeat to kick of immediately after a commit so that index replication
can be initiated? I like the latter approach. New versions are only published as frequently
as clients poll the master for updated IndexLocations. Clients keep a cache of both readable
and updatable index locations that are periodically refreshed.
  
- 6. Broadcasts versus IPC
+ ==== 6. Broadcasts versus IPC ====
  
  Currently Hadoop does not support broadcasts, and there are problems getting broadcasts
to work across clusters. Do we need to use broadcasts or can we use the same approach as HDFS
and Hbase?
  
- 7. Finding updateable indexes
+ ==== 7. Finding updateable indexes ====
  
  Looking at 
  {{{
@@ -118, +118 @@

  }}}
  I'd assumed that the updateable version of an index does not move around very often. Perhaps
a lease mechanism is required. For example, a call to getUpdateableIndex might be valid for
ten minutes.
  
- === Reply from Doug ===
+ ==== 8. What is an Index ID? ====
  
- That should not normally be the case. Clients can cache the set of writable index locations
and directly submit new documents without involving the master.
+ But what is index id exactly?  Looking at the example API you laid down, it must be a single
physical index (as opposed to a logical index).  In which case, is it entirely up to the client
to manage multi-shard indicies?  For example, if we had a "photo" index broken up into 3 shards,
each shard would have a separate index id and it would be up to the client to know this, and
to query across the different "photo0", "photo1", "photo2" indicies.  The master would
+ have no clue those indicies were related.  Hmmm, that doesn't work very well for deletes
though.
  
+ It seems like there should be the concept of a logical index, that is composed of multiple
shards, and each shard has multiple copies.
-     If deletes were broadcast, and documents could go to any partition,
-     that would be one way around it (with the downside of a less powerful
-     master that could implement certain distribution policies).
-     Another way to lessen the master-in-the-middle cost is to make sure
-     one can aggregate small requests:
-        IndexLocation[] getUpdateableIndex(String[] id);
  
- I'd assumed that the updateable version of an index does not move around very often. Perhaps
a lease mechanism is required. For example, a call to getUpdateableIndex might be valid for
ten minutes.
+ Or were you thinking that a cluster would only contain a single logical index, and hence
all different index ids are simply different shards of that single logical index?  That would
seem to be consistent with ClientToMasterProtocol .getSearchableIndexes() lacking an id argument.
  
-     We might consider a delete() on the master interface too. That way it could
+ ==== 9. What about SOLR? ====
  
+ It depends on the project scope and how extensible things are. It seems like the master
would be a WAR, capable of running stand-alone. What about index servers (slaves)?  Would
this project include just the interfaces to be implemented by Solr/Nutch nodes, some common
implementation code behind the interfaces in the form of a library, or also complete standalone
WARs?
-      3) hide the delete policy (broadcast or directl-to-server-that-has-doc)
-     2) potentially do some batching of deletes
-     1) simply do the delete locally if there is a single index partition
-     and this is a combination master/searcher
  
- I'm reticent to put any frequently-made call on the master. I'd prefer to keep the master
only involved at an executive level, with all per-document and per-query traffic going directly
from client to slave.
+ I'd need to be able to extend the ClientToSlave protocol to add additional methods for Solr
(for passing in extra parameters and returning various extra data such as facets, highlighting,
etc).
  
+ ==== 10. How does versioning work? ====
-     It seems like the master might want to be involved in commits too, or
-     maybe we just rely on the slave to master heartbeat to kick of
-     immediately after a commit so that index replication can be initiated?
  
+ Could this be done in Lucene? It would also need a way to open a specific index version
(rather than just the latest), but I guess that could also be hacked into Directory by hiding
later "segments" files (assumes lockless is committed).
- I like the latter approach. New versions are only published as frequently as clients poll
the master for updated IndexLocations. Clients keep a cache of both readable and updatable
index locations that are periodically refreshed.
- 
- I was not imagining a real-time system, where the next query after a document is added would
always include that document. Is that a requirement? That's harder.
- 
- At this point I'm mostly trying to see if this functionality would meet the needs of Solr,
Nutch and others.
- 
- Must we include a notion of document identity and/or document version in the mechanism?
Would that facillitate updates and coherency?
- 
- In Nutch a typical case is that you have a bunch of URLs with content that may-or-may-not
have been previously indexed. The approach I'm currently leaning towards is that we'd broadcast
the deletions of all of these to all slaves, then add index them to randomly assigned indexes.
In Nutch multiple clients would naturally be coordinated, since each url is represented only
once in each update cycle.
- 
- === Reply from Yonik ===
- 
- On 10/30/06, Doug Cutting <[EMAIL PROTECTED]> wrote:
- 
-     Yonik Seeley wrote:
-     > On 10/18/06, Doug Cutting <[EMAIL PROTECTED]> wrote:
-     >> We assume that, within an index, a file with a given name is written
-     >> only once.
-     >
-     > Is this necessary, and will we need the lockless patch (that avoids
-     > renaming or rewriting *any* files), or is Lucene's current index
-     > behavior sufficient?
- 
-     It's not strictly required, but it would make index synchronization a
-     lot simpler. Yes, I was assuming the lockless patch would be committed
-     to Lucene before this project gets very far.  Something more than that
-     would be required in order to keep old versions, but this could be as
-     simple as a Directory subclass that refuses to remove files for a time.
- 
- Or a snapshot (hard links) mechanism.
- Lucene would also need a way to open a specific index version (rather
- than just the latest), but I guess that could also be hacked into
- Directory by hiding later "segments" files (assumes lockless is
- committed).
- 
- 
-     > It's unfortunate the master needs to be involved on every document add.
- 
-     That should not normally be the case.
- 
- Ahh... I had assumed that "id" in the following method was document id:
-  IndexLocation getUpdateableIndex(String id);
- 
- I see now it's index id.
- 
- But what is index id exactly?  Looking at the example API you laid
- down, it must be a single physical index (as opposed to a logical
- index).  In which case, is it entirely up to the client to manage
- multi-shard indicies?  For example, if we had a "photo" index broken
- up into 3 shards, each shard would have a separate index id and it
- would be up to the client to know this, and to query across the
- different "photo0", "photo1", "photo2" indicies.  The master would
- have no clue those indicies were related.  Hmmm, that doesn't work
- very well for deletes though.
- 
- It seems like there should be the concept of a logical index, that is
- composed of multiple shards, and each shard has multiple copies.
- 
- Or were you thinking that a cluster would only contain a single
- logical index, and hence all different index ids are simply different
- shards of that single logical index?  That would seem to be consistent
- with ClientToMasterProtocol .getSearchableIndexes() lacking an id
- argument.
- 
- 
-     I was not imagining a real-time system, where the next query after a
-     document is added would always include that document.  Is that a
-     requirement?  That's harder.
- 
- Not real-time, but it would be nice if we kept it close to what Lucene
- can currently provide.
- Most people seem fine with a latency of minutes.
- 
- 
-     At this point I'm mostly trying to see if this functionality would meet
-     the needs of Solr, Nutch and others.
- 
- 
- It depends on the project scope and how extensible things are.
- It seems like the master would be a WAR, capable of running stand-alone.
- What about index servers (slaves)?  Would this project include just
- the interfaces to be implemented by Solr/Nutch nodes, some common
- implementation code behind the interfaces in the form of a library, or
- also complete standalone WARs?
- 
- I'd need to be able to extend the ClientToSlave protocol to add
- additional methods for Solr (for passing in extra parameters and
- returning various extra data such as facets, highlighting, etc).
- 
- 
-     Must we include a notion of document identity and/or document version in
-     the mechanism? Would that facillitate updates and coherency?
- 
- It doesn't need to be in the interfaces I don't think, so it depends
- on the scope of the index server implementations.
- 
  
  === Mark's comments ===
- 
  
  Rather than using HDFS, DLucene is heavily inspired by HDFS. This is because the files uses
in Lucene indexes are quite different from the files that HDFS was designed for. It uses a
similar replication algorithm, and where possible HDFS code although it was necessary to make
some local changes to the visibility of some classes and methods. 
  

Mime
View raw message