cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Cassandra Wiki] Update of "WritePathForUsers" by MichaelEdge
Date Mon, 30 Nov 2015 07:06:49 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.

The "WritePathForUsers" page has been changed by MichaelEdge:

  == The Local Coordinator ==
  The local coordinator receives the write request from the client and performs the following:
-   1. The local coordinator determines which nodes are responsible for storing the data:
+   1. Firstly, the local coordinator determines which nodes are responsible for storing the
-     * The first replica is chosen based on the Partitioner hashing the primary key
-     * Other replicas are chosen based on replication strategy defined for the keyspace
+     * The first replica is chosen based on hashing the primary key using the Partitioner;
the Murmur3Partitioner is the default.
+     * Other replicas are chosen based on the replication strategy defined for the keyspace.
In a production cluster this is most likely the NetworkTopologyStrategy.
    1. The write request is then sent to all replica nodes simultaneously.
    1. The total number of nodes receiving the write request is determined by the replication
factor for the keyspace.
+ == Replica Nodes ==
+ Replica nodes receive the write request from the local coordinator and perform the following:
+ 1. Write data to the Commit Log. This is a sequential, memory-mapped log file, on disk,
that can be used to rebuild MemTables if a crash occurs before the MemTable is flushed to
+ 1. Write data to the MemTable. MemTables are mutable, in-memory tables that are read/write.
Each physical table on each replica node has an associated MemTable.
+ 1. If the write request is a DELETE operation (whether a delete of a column or a row), a
tombstone marker is written to the Commit Log and MemTable to indicate the delete.
+ 1. If row caching is used, invalidate the cache for that row. Row cache is populated on
read only, so it must be invalidated when data for that row is written.
+ 1. Acknowledge the write request back to the local coordinator.
+ The local coordinator waits for the appropriate number of acknowledgements (dependent on
the consistency level for this write request) before acknowledging back to the client.
+ == Flushing MemTables ==
+ MemTables are flushed to disk based on various factors, some of which include:
+ * commitlog_total_space_in_mb is exceeded
+ * memtable_total_space_in_mb is exceeded
+ * ‘Nodetool flush’ command is executed
+ * Etc.
+ Each flush of a MemTable results in one new, immutable SSTable on disk. After the flush
an SSTable (Sorted String Table) is read-only. As with the write to the Commit Log, the write
to the SSTable data file is a sequential write operation. An SSTable consists of multiple
files, including the following:
+ * Bloom Filter
+ * Index
+ * Compression File (optional)
+ * Statistics File
+ * Data File
+ * Summary
+ * TOC.txt
+ Each MemTable flush executes the following steps:
+ 1. Sort the MemTable columns by row key
+ 1. Write the Bloom Filter
+ 1. Write the Index
+ 1. Serialise and write the data to the SSTable Data File
+ 1. Write Compression File (if compression is used)
+ 1. Write Statistics File
+ 1. Purge the written data from the Commit Log
+ Unavailable Replica Nodes and Hinted Handoff
+ When a local coordinator is unable to send data to a replica node due to the replica node
being unavailable, the local coordinator stores the data in its local system.hints table;
this process is known as Hinted Handoff. The data is stored for a default period of 3 hours.
When the replica node comes back online the coordinator node will send the data to the replica
+ Write Path Advantages
+ * The write path is one of Cassandra’s key strengths: for each write request one sequential
disk write plus one in-memory write occur, both of which are extremely fast.
+ * During a write operation, Cassandra never reads before writing, never rewrites data, never
deletes data and never performs random I/O.

View raw message