hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "DistributedLucene" by MarkButler
Date Tue, 02 Jun 2009 13:34:26 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by MarkButler:
http://wiki.apache.org/hadoop/DistributedLucene

------------------------------------------------------------------------------
  = Distributed Lucene =
+ 
+ This work has now been superseded by the Katta project
+ 
+ Katta project - http://www.sourceforge.net/projects/katta
  
  Doug Cutting's original proposal: http://www.mail-archive.com/general@lucene.apache.org/msg00338.html
  
+ Also see
- Code for this work is now available here:
- https://issues.apache.org/jira/browse/HADOOP-3394
  
- Also see
  Bailey project - http://www.sourceforge.net/projects/bailey
- Katta project - http://www.sourceforge.net/projects/katta
+ 
  Contrib for updating indexes using MapReduce - https://issues.apache.org/jira/browse/HADOOP-2951
  
- === Implementation Notes (Obsolete, retained for comments) ===
- 
- === Current Status ===
- 
- Currently there is an alpha implementation of the design outlined above specifically the
master, worker, client library and unit tests. 
- 
- Rather than using HDFS, the implementation (DLucene) is heavily inspired by HDFS. This is
because the files uses in Lucene indexes are quite different from the files that HDFS was
designed for. It uses a similar replication algorithm to HDFS, and where possible HDFS uses
code although it was necessary to make some local changes to the visibility of some classes
and methods. 
- 
- Unlike HDFS it currently uses a state less Master. In the event of a failure, the heartbeat
information sent by each worker contains a list of all indexes they own, and also the current
status of those indexes. This means it should be possible to swap over masters. However the
disadvantage is this will result in more network traffic per heartbeat.
- 
- Both the master and workers have a heart beat architecture. Workers have three types of
threads: one to service requests, one to send heartbeats to the master to inform it that the
worker is alive, and one to process replication tasks. Workers have two types of threads:
one to service requests, the other to perform failure detection and compute a replication
plan. A segment of this plan is then sent back to each worker in response to their heartbeat.
- 
- One of the aims of the implementation is to better understand how Hadoop works, so that
it is possible to create an architecture to simplify the creation of other specialized storage
or processing components for Hadoop. 
- 
- There are also a number of outstanding items of functionality:
- 
-    * There is no process to delete old versions of indexes after a predetermined time, as
in HDFS. 
-    * The implementation does not take advantage of Lucene's RAM based indexes to improve
efficient.
-    * In HDFS there is a "throttler" to control client requests. There is no equivalent functionlity
in DLucene.
-    * A cluster contains a number of replicas of an index, to support high availability /
load balancing. When a client writes to an index, the updates are sent to one replica which
creates a new version. When the changes are committed, the new version is then propagated
to other machines in the cluster. However if several clients update the same index at the
same time, this needs to be synchronized to the same replica. I haven't worked out a way of
doing this yet. See below for some discussion of some design alternatives.
-    * Replication takes advantage of the fact that it is quicker if a worker has an old copy
of an index. However the replication assignment algorithm does not yet take advantage of this.

-    * No benchmarks yet.
- 
- === Design decisions ===
- 
- ==== 1. Broadcasts versus IPC ====
- 
- In the original design discussion by Doug Cutting and Yonik Seeley, broadcasts were considered
for deletes and searches. The current implementation uses basic Hadoop IPC rather than broadcasts.

- 
- ==== 2. Sharding ====
- 
- The masters and workers know nothing about sharding. This is done at the client. This is
to simplify the master / worker implementations. 
- 
- ==== 3. How do searches work? ====
- 
- ''Searches could be broadcast to one index with each id and return merged results. The client
will load-balance both searches and updates.''
- 
- Current approach: The master and the workers know nothing about shards, so sharding needs
to implemented in the client API. One way to do this to adopt a convention to store the index
name and shard as a composite value in the index ID, for example
- {{{
- myindex-1
- myindex-2
- myindex-3
- }}} 
- then to perform a search, in the non-sharded case the client gets a list of all index locations
and identifies a set of possible locations it could query. It selects a replicas at random
to query, in the process performing load balancing. 
- 
- In the sharded case, it does the same thing but queries one replica of each shard. The workers
return results and the client API aggregates them. 
- 
- ==== 4. How do deletions work? ====
- 
- ''Deletions could be broadcast to all slaves. That would probably be fast enough. Alternately,
indexes could be partitioned by a hash of each document's unique id, permitting deletions
to be routed to the appropriate slave.''
- 
- Current approach: On non-sharded indexes, deletions are sent directly to the worker. On
sharded ones, they work like searches described above. 
- 
- ==== 5. How does update work? ====
- 
- ''The master should be out of the loop as much as possible. One approach is that clients
randomly assign documents to indexes and send the updates directly to the indexing node. 
- 
- One potental problem is a document overwrite implemented as a delete then an add. More than
one client doing this for the same document could result in 0 or 2 documents, instead of 1.
 I guess clients will just need to be relatively coordinated in their activities. Either the
two clients must coordinate, to make sure that they're not updating the same document at the
same time, or use a strategy where updates are routed to the slave that contained the old
version of the document. That would require a broadcast query to figure out which slave that
is.
- 
- Good point. Either the two clients must coordinate, to make sure that they're not updating
the same document at the same time, or use a strategy where updates are routed to the slave
that contained the old version of the document. That would require a broadcast query to figure
out which slave that is.''
- 
- This needs some discussion. Currently the implementation uses a versioning approach, so
when a client starts to make changes to an index, the relevant worker copies the index, increments
the version number, and sets the IndexState as UNCOMMTTTED. The client can then add and delete
documents or add a remote index. When the client commits the index, it becomes available to
be searched and replicated. 
- 
- Therefore there is a danger that two clients could edit the same index at the same time.
One possibility here would be to bind a particular version of an index to a particular client.
If the client fails that is not a problem, the changes are just uncommitted. However there
is still a danger of a race condition when there are two different branches of the same index.

- 
- ==== 6. How do additions work? ====
- 
- The master should not be involved in adds. Clients can cache the set of writable index locations
and directly submit new documents without involving the master.
- 
- The master should be out of the loop as much as possible. One approach is that clients randomly
assign documents to indexes and send the updates directly to the indexing node. '''Alternately,
clients might index locally, then ship the updates to a node packaged as an index. That was
the intent of the addIndex method.'''
- 
- ==== 7. How do commits work? ====
- 
- ''It seems like the master might want to be involved in commits too, or maybe we just rely
on the slave to master heartbeat to kick of immediately after a commit so that index replication
can be initiated? I like the latter approach. New versions are only published as frequently
as clients poll the master for updated IndexLocations. Clients keep a cache of both readable
and updatable index locations that are periodically refreshed.''
- 
- Current approach: When an index becomes committed, wait until next heartbeat for replication
to begin. It might be possible to bring this forward, but realistically I couldn't see any
advantage to this. 
- 
- There are probably some subtlties about heartbeating mechanisms on large clusters. For example,
if heartbeats are staggered it is good (no collisions), if they are synchronous it is bad
(collisions). Does it settle to a steady state? If so changing heartbeat intervals could cause
problems. I'm just guessing here though. The main point is this is a complex system, so sometimes
doing things that seem obvious can have unexpected (and possibly deterimental) effects. It's
important to do the math to check the likely result is what you expect. 
- 
- ==== 8. Finding updateable indexes ====
- 
- ''Looking at''
- {{{
-        IndexLocation[] getUpdateableIndex(String[] id);
- }}}
- ''I'd assumed that the updateable version of an index does not move around very often. Perhaps
a lease mechanism is required. For example, a call to getUpdateableIndex might be valid for
ten minutes.''
- 
- It's hard to give guarantees here as nodes can fail at any time ...
- 
- ==== 9. What is an Index ID? ====
- 
- ''But what is index id exactly?  Looking at the example API you laid down, it must be a
single physical index (as opposed to a logical index).  In which case, is it entirely up to
the client to manage multi-shard indicies?  For example, if we had a "photo" index broken
up into 3 shards, each shard would have a separate index id and it would be up to the client
to know this, and to query across the different "photo0", "photo1", "photo2" indicies.  The
master would
- have no clue those indicies were related.  Hmmm, that doesn't work very well for deletes
though.
- 
- It seems like there should be the concept of a logical index, that is composed of multiple
shards, and each shard has multiple copies.
- 
- Or were you thinking that a cluster would only contain a single logical index, and hence
all different index ids are simply different shards of that single logical index?  That would
seem to be consistent with ClientToMasterProtocol .getSearchableIndexes() lacking an id argument.''
- 
- This comes back to how we implement sharding discussed above ...
- 
- ==== 10. What about SOLR? ====
- 
- ''It depends on the project scope and how extensible things are. It seems like the master
would be a WAR, capable of running stand-alone. What about index servers (slaves)?  Would
this project include just the interfaces to be implemented by Solr/Nutch nodes, some common
implementation code behind the interfaces in the form of a library, or also complete standalone
WARs?
- 
- I'd need to be able to extend the ClientToSlave protocol to add additional methods for Solr
(for passing in extra parameters and returning various extra data such as facets, highlighting,
etc).''
- 
- ==== 11. How does versioning work? ====
- 
- ''Could this be done in Lucene? It would also need a way to open a specific index version
(rather than just the latest), but I guess that could also be hacked into Directory by hiding
later "segments" files (assumes lockless is committed).''
- 
- Current version: At the moment, it just copies the index for a new version. This is going
to be expensive in disk space. But when it replicates indexes, it just copies recent changes
if a local copy of the index already exists, so network traffic should be efficient. 
- 
- == Related Pages ==
- 
- http://wiki.apache.org/solr/DistributedSearch - Distributed search in SOLR
- 
- http://wiki.apache.org/solr/CollectionDistribution - Collection distribution in SOLR
- 

Mime
View raw message