cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Cassandra Wiki] Update of "ArchitectureInternals" by PeterSchuller
Date Tue, 12 Apr 2011 21:54:54 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.

The "ArchitectureInternals" page has been changed by PeterSchuller.
The comment on this change is: Augment the read path description a bit to account better for
read repair and snitches..


   * See [[ArchitectureSSTable]] and ArchitectureCommitLog for more details
  = Read path =
-  * !StorageProxy gets the nodes responsible for replicas of the keys from the !ReplicationStrategy,
then sends read messages to them
+  * !StorageProxy gets the endpoints (nodes) responsible for replicas of the keys from the
     * This may be a !SliceFromReadCommand, a !SliceByNamesReadCommand, or a !RangeSliceReadCommand,
+  * StorageProxy filters the endpoints to contain only those that are currently up/alive
+  * StorageProxy then sorts, by asking the endpoint snitch, the responsible nodes by "proximity".
+    * The definition of "proximity" is up to the endpoint snitch
+      * With a SimpleSnitch, proximity directly corresponds to proximity on the token ring.
+      * With the NetworkTopologySnitch, endpoints that are in the same rack are always considered
"closer" than those that are not. Failing that, endpoints in the same data center are always
considered "closer" than those that are not.
+      * The DynamicSnitch, typically enabled in the configuration, wraps whatever underlying
snitch (such as SimpleSnitch and NetworkTopologySnitch) so as to dynamically adjust the perceived
"closeness" of endpoints based on their recent performance. This is in an effort to try to
avoid routing traffic to endpoints that are slow to respond.
+  * StorageProxy then arranges for messages to be sent to nodes as required:
+    * The closest node (as determined by proximity sorting as described above) will be sent
a command to perform an actual data read (i.e., return data to the co-ordinating node). 
+    * As required by consistency level, additional nodes may be sent digest commands, asking
them to perform the read locally but send back the digest only. For example, at replication
factor 3 a read at consistency level QUORUM would require one digest read in additional to
the data read sent to the closest node. (See ReadCallback, instantiated by StorageProxy)
+    * If read repair is enabled (probabilistically if read repair chance is somewhere between
0% and 100%), remaining nodes responsible for the row will be sent messages to compute the
digest of the response. (Again, see ReadCallback, instantiated by StorageProxy)
   * On the data node, !ReadVerbHandler gets the data from CFS.getColumnFamily or CFS.getRangeSlice
and sends it back as a !ReadResponse
     * The row is located by doing a binary search on the index in SSTableReader.getPosition
     * For single-row requests, we use a !QueryFilter subclass to pick the data from the Memtable
and SSTables that we are looking for.  The Memtable read is straightforward.  The SSTable
read is a little different depending on which kind of request it is:
@@ -30, +40 @@

       * If we are reading a group of columns by name, we still use the column index to locate
each column, but first we check the row-level bloom filter to see if we need to do anything
at all
     * The column readers provide an Iterator interface, so the filter can easily stop when
it's done, without reading more columns than necessary
       * Since we need to potentially merge columns from multiple SSTable versions, the reader
iterators are combined through a !ReducingIterator, which takes an iterator of uncombined
columns as input, and yields combined versions as output
-  * If a quorum read was requested, !StorageProxy waits for a majority of nodes to reply
and makes sure the answers match before returning.  Otherwise, it returns the data reply as
soon as it gets it, and checks the other replies for discrepancies in the background in !StorageService.doConsistencyCheck.
 This is called "read repair," and also helps achieve consistency sooner.
-    * As an optimization, !StorageProxy only asks the closest replica for the actual data;
the other replicas are asked only to compute a hash of the data.
+ In addition:
+  * At any point if a message is destined for the local node, the appropriate piece of work
(data read or digest read) is directly submitted to the appropriate local stage (see StageManager)
rather than going through messaging over the network.
+  * The fact that a data read is only submitted to the closest replica is intended as an
optimization to avoid sending excessive amounts of data over the network. A digest read will
take the full cost of a read internally on the node (CPU and in particular disk), but will
avoid taxing the network.
  = Deletes =
   * See DistributedDeletes

View raw message