lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "NewSolrCloudDesign" by NoblePaul
Date Thu, 18 Aug 2011 07:38:37 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "NewSolrCloudDesign" page has been changed by NoblePaul:
http://wiki.apache.org/solr/NewSolrCloudDesign?action=diff&rev1=3&rev2=4

   1. The source node is instructed to force refresh the node -> partition map from ZooKeeper
   1. Goto step #1
  
+ === Node Restart ===
+ 
+ A node restart can mean one of the following things:
+  * The JVM crashed and was manually or automatically restarted
+  * The node was in a temporary network partition and either could not reach ZooKeeper (and
was supposed to be dead) or could not receive updates from the leader for a period of time.
A node restart ine node failure.
+  * Lifecycle of a Write Operation this scenario signifies the removal of the network partition.
+  * A hardware failure or maintenance window caused the removal of the node from the cluster
and the node has been started again to rejoin the cluster.
+ 
+ 
+ The node reads the list of partitions for which it is a member and for each partition, starts
a recovery process from each partition’s leader respectively. Then, the node follows the
steps in the New Node section without checking for the auto_add_new_nodes property. This ensures
that the cluster recovers from the imbalance created by th
+ 
+ Writes are performed by clients using the standard Solr update formats. A write operation
can be sent to any node in the cluster. The node uses the hash_function , and the Range-Partition
mapping to identify the partition where the doc belongs to. A zookeeper lookup is performed
to identify the leader of the shard and the operation is forwarded there. A SolrJ enhancement
may enable it to send the write directly to the leader
+ 
+ The leader assigns the operation a Partition Version and writes the operation to its transaction
log and forwards the document + version + hash to other nodes belonging to the shard. The
nodes write the document + hash to the index and record the operation in the transaction log.
The leader responds with an ‘OK’ if at least min_writes number of nodes respond with ‘OK’.
The min_writes in the cluster properties can be overridden by specifying it in the write request.
+ 
+ The cloud mode would not offer any explicit commit/rollback operations. The commits are
managed by auto-commits at intervals (commit_within) by the leader and triggers a commit on
all members on the shard. The latest version available to a node is recorded with the commit
point.
+ 
+ == Transaction Log ==
+ 
+  * A transaction log records all operations performed on an Index between two commits
+  * Each commit starts a new transaction log because a commit guarantees durability of operations
performed before it
+  * The sync can be tunable e.g. flush vs fsync by default can protect against JVM crashes
but not against power failure and can be much faster
+ 
+ == Recovery ==
+ 
+ A recovery can be triggered during:
+  * Bootstrap
+  * Partition splits
+  * Cluster re-balancing
+ 
+ 
+ The node starts by setting its status as ‘recovering’. During this phase, the node will
not receive any read requests but it will receive all new write requests which shall be written
to a separate transaction log. The node looks up the version of index it has and queries the
‘leader’ for the latest version of the partition. The leader responds with the set of
operations to be performed before the node can be in sync with the rest of the nodes in the
shard.
+ 
+ This may involve copying the index first and replaying the transaction log depending on
where the node is w.r.t the state of the art. If an index copy is required, the index files
are replicated first to the local index and then the transaction logs are replayed. The replay
of transaction log is nothing but a stream of regular write requests. During this time, the
node may have accumulated new writes, which should then be played back on the index. The moment
the node catches up with the latest commit point, it marks itself as “ready”. At this
point, read requests can be handled by the node.
+ 
+ === Handling Node Failures ===
+ 
+ There may be temporary network partitions between some nodes or between a node and ZooKeeper.
The cluster should wait for some time before re-balancing data.
+ 
+ ==== Leader failure ====
+ 
+ If node fails and if it is a leader of any of the shards, the other members will initiate
a leader election process. Writes to this partition are not accepted until the new leader
is elected. Then it follows the steps in non-leader failure 
+ 
+  
+ ==== Non-Leader failure ====
+ 
+ The leader would wait for the min_reaction_time before identifying a new node to be a part
of the shard.
+ The leader acquires the Cluster Lock and uses the node-shard assignment algorithm to identify
a node as
+ the new member of the shard. The node -> partition mapping is updated in ZooKeeper and
the cluster lock is
+ released. The new node is then instructed to force reload the node -> partition mapping
from ZooKeeper.
+ 
+ == Splitting partitions ==
+ 
+ A partition can be split either by an explicit cluster admin command or automatically by
splitting strategies provided by Solr. An explicit split command may give specify target partition(s)
for split. 
+ 
+ Assume the partition ‘X’ with hash range 100 - 199 is identified to be split into X
(100 - 149) and a new partition Y (150 - 199). The leader of X records the split action in
ZooKeeper with the new desired range values of X as well as Y. No nodes are notified of this
split action or the existence of the new partition.
+ 
+  1. The leader of X, acquires the Cluster Lock and identifies nodes which can be assigned
to partition Y (algorithm TBD) and informs them of the new partition and updates the partition
-> node mapping. The leader of X waits for the nodes to respond and once it determines
that the new partition is ready to accept commands, it proceeds as follows:
+  1. The leader of X suspends all commits until the split is complete.
+  1. The leader of X opens an IndexReader on the latest commit point (say version V) and
instructs its peers to do the same.
+  1. The leader of X starts streaming the transaction log after version V for the hash range
150 - 199 to the leader of Y.
+  1. The leader of Y records the requests sent in #2 in its transaction log only i.e. it
is not played on the index.
+  1. The leader of X initiates an index split on the IndexReader opened in step #2.
+  1. The index created in #5 is sent to the leader of Y and is installed.
+  1. The leader of Y instructs its peers to start recovery process. At the same time, it
starts playing its transaction log on the index.
+   a. Once all peers of partition Y have reached at least version V:
+   a. The leader of Y asks the leader of X to prepare a FilteredIndexReader on top of the
reader created in step #2 which will have documents belonging to hash range 100 - 149 only.
+   a. Once the leader of X acknowledges the completion of request in #8a, the leader of Y
acquires the Cluster Lock and modifies the range -> partition mapping to start receiving
regular search/write requests from the whole cluster.
+   a. The leader of Y asks leader of X to start using the FilteredIndexReader created in
#8a for search requests.
+   a. The leader of Y asks leader of X to force refresh the range -> partition mapping
from ZooKeeper. At this point, it is guaranteed that the transaction log streaming which started
in #3 will be stopped.
+  1. The leader of X will delete all documents with hash values not belonging to its partitions,
commits and re-opens the searcher on the latest commit point.
+  1. At this point, the split is considered complete and leader of X resumes commits according
to the commit_within parameters.
+ 
+ 
+ Notes:
+  * The partition being split does not honor commit_within parameter until the split operation
completes
+  * Any distributed search operation performed starting at the time of #8b and till the end
of #8c can return inconsistent results i.e. the number of search results may be wrong.
+ 
+ 
+ == Cluster Re-balancing ==
+ 
+ The cluster can be rebalanced by an explicit cluster admin command.
+ 
+ TBD
+ 
+ == Monitoring ==
+ TBD
+ 
+ == Configuration ==
+ 
+ === solr_cluster.properties ===
+ 
+ This are the set of properties which are outside of the regular Solr configuration and is
applicable across all nodes in the cluster:
+  * replication_factor :  The number of replicas of a doc maintained by the cluster
+  * min_writes : Minimum no:of successful writes before the write operation is signaled as
successful . This may me overridden on a per write basis
+  * commit_within :  This is the max time within which write operation is visible in a search
+  * hash_function : The implementation which computes the hash of a given doc
+  * max_hash_value : The maximum value that a hash_function can output. Theoretically, this
is also the maximum number of partitions the cluster can ever have
+  * min_reaction_time : The time before any reallocation/splitting is done after a node comes
up or goes down (in secs)
+  * min_replica_for_reaction :  If the number of replica nodes go below this threshold the
splitting is triggered even if the min_reaction_time is not met
+  * auto_add_new_nodes :  A Boolean flag. If true, new nodes are automatically used as read
replicas to existing partitions, otherwise, new nodes sit idle until the cluster needs them.
+ 

Mime
View raw message