lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Trivial Update of "NewSolrCloudDesign" by NoblePaul
Date Thu, 18 Aug 2011 07:23:40 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "NewSolrCloudDesign" page has been changed by NoblePaul:
http://wiki.apache.org/solr/NewSolrCloudDesign?action=diff&rev1=2&rev2=3

  
  == Glossary ==
  
-  * Cluster : Cluster is a set of Solr nodes managed as a single unit. The entire cluster
must have a single schema and solrconfig
+  * '''Cluster''' : Cluster is a set of Solr nodes managed as a single unit. The entire cluster
must have a single schema and solrconfig
-  * Node : A JVM instance running Solr
+  * '''Node''' : A JVM instance running Solr
-  * Partition : A partition is a subset of the entire document collection.  A partition is
created in such a way that all its documents can be contained in a single index.
+  * '''Partition''' : A partition is a subset of the entire document collection.  A partition
is created in such a way that all its documents can be contained in a single index.
-  * Shard : A Partition needs to be stored in multiple nodes as specified by the replication
factor. All these nodes collectively form a shard. A node may be a part of multiple shards
 
+  * '''Shard''' : A Partition needs to be stored in multiple nodes as specified by the replication
factor. All these nodes collectively form a shard. A node may be a part of multiple shards
 
-  * Leader : Each Shard has one node identified as its leader. All the writes for documents
belonging to a partition should be routed through the leader.
+  * '''Leader''' : Each Shard has one node identified as its leader. All the writes for documents
belonging to a partition should be routed through the leader.
-  * Replication Factor : Minimum number of copies of a document maintained by the cluster
+  * '''Replication Factor''' : Minimum number of copies of a document maintained by the cluster
-  * Transaction Log : An append-only log of write operations maintained by each node
+  * '''Transaction Log''' : An append-only log of write operations maintained by each node
-  * Partition version : This is a counter maintained with the leader of each shard and incremented
on each write operation and sent to the peers
+  * '''Partition version''' : This is a counter maintained with the leader of each shard
and incremented on each write operation and sent to the peers
-  * Cluster Lock : This is a global lock which must be acquired in order to change the range
-> partition or the partition -> node mappings.
+  * '''Cluster Lock''' : This is a global lock which must be acquired in order to change
the range -> partition or the partition -> node mappings.
  
  == Guiding Principles ==
  
@@ -31, +31 @@

   * Automatic failover of writes
   * Automatically honour the replication factor in the event of a node failure
  
+ 
+ == Zookeeper ==
+ 
+ A ZooKeeper cluster is used as:
+  * The central configuration store for the cluster
+  * A co-ordinator for operations requiring distributed synchronization
+  * The system-of-record for cluster topology
+ 
+ == Partitioning ==
+ 
+ The cluster is configured with a fixed max_hash_value (which is set to a fairly large value,
say  1000)  ‘N’. Each document’s hash is calculated as:
+ {{{
+ 
+ hash = hash_function(doc.getKey()) % N
+ }}}
+ 
+ Ranges of hash values are assigned to partitions and stored in Zookeeper.  For example we
may have a range to partition mapping as follows
+ {{{
+ range  : partition
+ ------  ----------
+ 0 - 99 : 1
+ 100-199: 2
+ 200-299: 3
+ }}}
+ 
+ The hash is added as an indexed field in the doc and it is immutable. This may also be used
during an index split
+ 
+ The hash function is pluggable. It can accept a document and return a consistent & positive
integer hash value. The system provides a default hash function which uses the content of
a configured, required & immutable field (default is unique_key field) to calculate hash
values.
+ 
+ == Shard Assignment ==
+ 
+ The node -> partition mapping can only be changed by a node which has acquired the Cluster
Lock in ZooKeeper. So when a node comes up, it first attempts to acquire the cluster lock,
waits for it to be acquired and then identifies the partition to which it can subscribe to.
+ 
+ === Node to a shard assignment ===
+ 
+ The node which is trying to find a new node should acquire the cluster lock first. First
of all the leader is identified for the shard. Out of the all the available nodes, the node
with the least number of shards is selected. If there is a tie, the node which is a leader
to the least number of shard is chosen. If there is a tie, a random node is chosen.
+ 
+ 
+ == Boot Strapping ==
+ 
+ === Cluster Startup ===
+ 
+ A node is started pointing to a Zookeeper host and port. The first node in the cluster may
be started with cluster configuration properties and the schema/config files for the cluster.
The first node would upload the configuration into zookeeper and bootstrap the cluster. The
cluster is deemed to be in the “bootstrap” state. In this state, the node -> partition
mapping is not computed and the cluster does not accept any read/write requests except for
clusteradmin commands.
+ 
+ After the initial set of nodes in the cluster have started up, a clusteradmin command (TBD
description) is issued by the administrator. This command accepts an integer “partitions”
parameter and it performs the following steps:
+  1. Acquire the Cluster Lock
+  1. Allocate the “partitions” number of partitions
+  1. Acquires nodes for each partition
+  1. Updates the node -> partition mapping in ZooKeeper
+  1. Release the Cluster Lock
+  1. Informs all nodes to force update their own node -> partition mapping from ZooKeeper
+ 
+ 
+ === Node Startup ===
+ 
+ The node upon startup, checks ZooKeeper if it is a part of existing shard(s). If ZooKeeper
has no record of the node or if the node is not part of any shards, it follows the steps in
the New Node section else it follows the steps in the Node Restart section.
+ 
+ ==== New Node ====
+ 
+ A new node is one which has never been part of the cluster and is newly added to increase
the capacity of the cluster.
+ 
+ If the “auto_add_new_nodes” cluster property is false, the new nodes register themselves
in ZooKeeper as “idle” and wait until another node asks them to participate. Otherwise,
they proceed as follows:
+  1. The Cluster Lock is acquired
+  1. A suitable source (node, partition) tuple is chosen:
+   1. The list of available partitions are scanned to find partitions which has less then
“replication_factor” number of nodes. In case of tie, the partition with the least number
of nodes is selected. In case of another tie, a random partition is chosen.
+   1. If all partitions have enough replicas, the nodes are scanned to find one which has
most number of partitions. In case of tie, of all the partitions in such nodes, the one which
has the most number of documents is chosen. In case of tie, a random partition on a random
node is chosen.
+   1. If moving the chosen (node, partition) tuple to the current node will decrease the
maximum number of partition:node ratio of the cluster, the chosen tuple is returned.Otherwise,
no (node, partition) is chosen and the algorithm terminates
+   1. The node -> partition mapping is updated in ZooKeeper
+  1. The node status in ZooKeeper is updated to “recovery” state
+  1. The Cluster Lock is released
+  1. A “recovery” is initiated against the leader of the chosen partition
+  1. After the recovery is complete, the Cluster Lock is acquired again
+  1. The source (node, partition) is removed from the node -> partition map in ZooKeeper
+  1. The Cluster Lock is released
+  1. The source node is instructed to force refresh the node -> partition map from ZooKeeper
+  1. Goto step #1
+ 

Mime
View raw message