hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Update of "DistributedLucene" by MarkButler
Date Wed, 19 Dec 2007 15:15:44 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by MarkButler:
http://wiki.apache.org/lucene-hadoop/DistributedLucene

------------------------------------------------------------------------------
  
  ==== 1. Broadcasts versus IPC ====
  
- Currently Hadoop does not support broadcasts, and there are problems getting broadcasts
to work across clusters. Do we need to use broadcasts or can we use the same approach as HDFS
and Hbase?
+ In the previous design discussion, there was talk about using broadcasts. However Hadoop
does not support broadcasts, and there are problems getting broadcasts to work across clusters.
Do we need to use broadcasts or can we use the same approach as HDFS and Hbase?
  
- Current approach: does not use broadcasts. 
+ Current approach: do not use broadcasts. 
  
  ==== 2. How do searches work? ====
  
- '''Searches could be broadcast to one index with each id and return merged results. The
client will load-balance both searches and updates.'''
+ ''Searches could be broadcast to one index with each id and return merged results. The client
will load-balance both searches and updates.''
  
- Current approach: Sharding is implemented in client API. Currently the master and the workers
know nothing about shards. If a client needs to shard an index, it needs to take responsibility,
perhaps using some convention to store the index name and shard as a composite value in the
index ID. To perform a search, the client gets a list of all indexes, then selects replicas
at random to query to perform load balancing. The workers return results and the client API
aggregates them. 
+ Current approach: The master and the workers know nothing about shards, so sharding needs
to implemented in the client API. One way to do this to adopt a convention to store the index
name and shard as a composite value in the index ID, for example
+ {{{
+ myindex-1
+ myindex-2
+ myindex-3
+ }}} 
+ then to perform a search, in the non-sharded case the client gets a list of all index locations
and identifies a set of possible locations it could query. It selects a replicas at random
to query, in the process performing load balancing. 
+ 
+ In the sharded case, it does the same thing but queries one replica of each shard. The workers
return results and the client API aggregates them. 
  
  ==== 3. How do deletions work? ====
  
@@ -106, +114 @@

  
  Good point. Either the two clients must coordinate, to make sure that they're not updating
the same document at the same time, or use a strategy where updates are routed to the slave
that contained the old version of the document. That would require a broadcast query to figure
out which slave that is.''
  
+ This needs some discussion. Currently the implementation uses a versioning approach, so
when a client starts to make changes to an index, the relevant worker copies the index, increments
the version number, and sets the IndexState as UNCOMMTTTED. The client can then add and delete
documents or add a remote index. When the client commits the index, it becomes available to
be searched and replicated. 
+ 
+ Therefore there is a danger that two clients could edit the same index at the same time.
One possibility here would be to bind a particular version of an index to a particular client.
If the client fails that is not a problem, the changes are just uncommitted. However there
is still a danger of a race condition when there are two different branches of the same index.

+ 
  ==== 5. How do additions work? ====
  
  The master should not be involved in adds. Clients can cache the set of writable index locations
and directly submit new documents without involving the master.
  
  The master should be out of the loop as much as possible. One approach is that clients randomly
assign documents to indexes and send the updates directly to the indexing node. '''Alternately,
clients might index locally, then ship the updates to a node packaged as an index. That was
the intent of the addIndex method.'''
  
+ TODO: Think how to do this in the client API. 
+ 
  ==== 6. How do commits work? ====
  
  ''It seems like the master might want to be involved in commits too, or maybe we just rely
on the slave to master heartbeat to kick of immediately after a commit so that index replication
can be initiated? I like the latter approach. New versions are only published as frequently
as clients poll the master for updated IndexLocations. Clients keep a cache of both readable
and updatable index locations that are periodically refreshed.''
  
+ Current approach: When an index becomes committed, wait until next heartbeat for replication
to begin. It might be possible to bring this forward, but realistically I couldn't see any
advantage to this. 
+ 
+ There are probably some subtlties about heartbeating mechanisms on large clusters. For example,
if heartbeats are staggered it is good (no collisions), if they are synchronous it is bad
(collisions). Does it settle to a steady state? If so changing heartbeat intervals could cause
problems. I'm just guessing here though. The main point is this is a complex system, so sometimes
doing things that seem obvious can have unexpected (and possibly deterimental) effects. It's
important to do the math to check the likely result is what you expect. 
+ 
  ==== 7. Finding updateable indexes ====
  
  ''Looking at''
@@ -123, +141 @@

         IndexLocation[] getUpdateableIndex(String[] id);
  }}}
  ''I'd assumed that the updateable version of an index does not move around very often. Perhaps
a lease mechanism is required. For example, a call to getUpdateableIndex might be valid for
ten minutes.''
+ 
+ It's hard to give guarantees here as nodes can fail at any time ...
  
  ==== 8. What is an Index ID? ====
  
@@ -133, +153 @@

  
  Or were you thinking that a cluster would only contain a single logical index, and hence
all different index ids are simply different shards of that single logical index?  That would
seem to be consistent with ClientToMasterProtocol .getSearchableIndexes() lacking an id argument.''
  
+ This comes back to how we implement sharding discussed above ...
+ 
  ==== 9. What about SOLR? ====
  
  ''It depends on the project scope and how extensible things are. It seems like the master
would be a WAR, capable of running stand-alone. What about index servers (slaves)?  Would
this project include just the interfaces to be implemented by Solr/Nutch nodes, some common
implementation code behind the interfaces in the form of a library, or also complete standalone
WARs?
@@ -142, +164 @@

  ==== 10. How does versioning work? ====
  
  ''Could this be done in Lucene? It would also need a way to open a specific index version
(rather than just the latest), but I guess that could also be hacked into Directory by hiding
later "segments" files (assumes lockless is committed).''
+ 
+ Current version: At the moment, it just copies the index for a new version. This is going
to be expensive in disk space. But when it replicates indexes, it just copies recent changes
if a local copy of the index already exists, so network traffic should be efficient. 
  
  === Mark's comments ===
  

Mime
View raw message