cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Cassandra Wiki] Update of "ArchitectureInternals" by JonathanEllis
Date Tue, 25 Sep 2012 15:00:35 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.

The "ArchitectureInternals" page has been changed by JonathanEllis:
http://wiki.apache.org/cassandra/ArchitectureInternals?action=diff&rev1=26&rev2=27

Comment:
update general and write sections

  = General =
   * Configuration file is parsed by !DatabaseDescriptor (which also has all the default values,
if any)
-  * Thrift generates an API interface in Cassandra.java; the implementation is !CassandraServer,
and !CassandraDaemon ties it together.
+  * Thrift generates an API interface in Cassandra.java; the implementation is !CassandraServer,
and !CassandraDaemon ties it together (mostly: handling commitlog replay, and setting up the
Thrift plumbing)
-  * !CassandraServer turns thrift requests into the internal equivalents, then !StorageProxy
does the actual work, then !CassandraServer turns it back into thrift again
+  * !CassandraServer turns thrift requests into the internal equivalents, then !StorageProxy
does the actual work, then !CassandraServer turns the results back into thrift again
-  * !StorageService is kind of the internal counterpart to !CassandraDaemon.  It handles
turning raw gossip into the right internal state.
-  * !AbstractReplicationStrategy controls what nodes get secondary, tertiary, etc. replicas
of each key range.  Primary replica is always determined by the token ring (in !TokenMetadata)
but you can do a lot of variation with the others.  !RackUnaware just puts replicas on the
next N-1 nodes in the ring.  !RackAware puts the first non-primary replica in the next node
in the ring in ANOTHER data center than the primary; then the remaining replicas in the same
as the primary.
+    * CQL requests are compiled and executed through QueryProcessor.  Note that as of 1.2
we still support both the old cql2 dialect and the cql3, in different packages.
+  * !StorageService is kind of the internal counterpart to !CassandraDaemon.  It handles
turning raw gossip into the right internal state and dealing with ring changes, i.e., transferring
data to new replicas.  !TokenMetadata tracks which nodes own what arcs of the ring.  Starting
in 1.2, each node may have multiple Tokens.
+  * !AbstractReplicationStrategy controls what nodes get secondary, tertiary, etc. replicas
of each key range.  Primary replica is always determined by the token ring (in !TokenMetadata)
but you can do a lot of variation with the others.  !SimpleStrategy just puts replicas on
the next N-1 nodes in the ring.  !NetworkTopologyStrategy allows the user to define how many
replicas to place in each datacenter, and then takes rack locality into account for each DC
-- we want to avoid multiple replicas on the same rack, if possible.
   * !MessagingService handles connection pooling and running internal commands on the appropriate
stage (basically, a threaded executorservice).  Stages are set up in !StageManager; currently
there are read, write, and stream stages.  (Streaming is for when one node copies large sections
of its SSTables to another, for bootstrap or relocation on the ring.)  The internal commands
are defined in !StorageService; look for `registerVerbHandlers`.
-  * Configuration for the node (administrative stuff, such as which directories to store
data in, as well as global configuration, such as which global partitioner to use) is held
by !DatabaseDescriptor. Per-KS, per-CF, and per-Column metadata are all stored as migrations
across the database and can be updated by calls to system_update/add_* thrift calls, or can
be changed locally and temporarily at runtime. See ConfigurationNotes.
+  * Configuration for the node (administrative stuff, such as which directories to store
data in, as well as global configuration, such as which global partitioner to use) is held
by !DatabaseDescriptor. Per-KS, per-CF, and per-Column metadata are all stored as parts of
the Schema: !KSMetadata, !CFMetadata, !ColumnDefinition. See also ConfigurationNotes.
  
  = Write path =
   * !StorageProxy gets the nodes responsible for replicas of the keys from the !ReplicationStrategy,
then sends !RowMutation messages to them.
     * If nodes are changing position on the ring, "pending ranges" are associated with their
destinations in !TokenMetadata and these are also written to.
-    * If nodes that should accept the write are down, but the remaining nodes can fulfill
the requested !ConsistencyLevel, the writes for the down nodes will be sent to another node
instead, with a header (a "hint") saying that data associated with that key should be sent
to the replica node when it comes back up.  This is called HintedHandoff and reduces the "eventual"
in "eventual consistency."  Note that HintedHandoff is only an '''optimization'''; ArchitectureAntiEntropy
is responsible for restoring consistency more completely.
+    * ConsistencyLevel determines how many replies to wait for.  See !WriteResponseHandler.determineBlockFor.
 Interaction with pending ranges is a bit tricky; see https://issues.apache.org/jira/browse/CASSANDRA-833
+    * If the FailureDetector says that we don't have enough nodes alive to satisfy the ConsistencyLevel,
we fail the request with !UnavailableException
+    * If the FD gives us the okay but writes time out anyway because of a failure after the
request is sent or because of an overload scenario, !StorageProxy will write a "hint" locally
to replay the write when the replica(s) timing out recover.  This is called HintedHandoff.
 Note that HH does not prevent inconsistency entirely; either unclean shutdown or hardware
failure can prevent the coordinating node from writing or replaying the hint. ArchitectureAntiEntropy
is responsible for restoring consistency more completely.
-  * on the destination node, !RowMutationVerbHandler uses Table.Apply to hand the write first
to !CommitLog.java, then to the Memtable for the appropriate !ColumnFamily.
+  * on the destination node, !RowMutationVerbHandler uses Table.Apply to hand the write first
to the !CommitLog, then to the Memtable for the appropriate !ColumnFamily.
-  * When a Memtable is full, it gets sorted and written out as an SSTable asynchronously
by !ColumnFamilyStore.switchMemtable
+  * When a Memtable is full, it gets sorted and written out as an SSTable asynchronously
by !ColumnFamilyStore.maybeSwitchMemtable (so named because multiple concurrent calls to it
will only flush once)
+    * "Fullness" is monitored by !MeteredFlusher; the goal is to flush quickly enough that
we don't OOM as new writes arrive while we still have to hang on to the memory of the old
memtable during flush
     * When enough SSTables exist, they are merged by !CompactionManager.doCompaction
-      * Making this concurrency-safe without blocking writes or reads while we remove the
old SSTables from the list and add the new one is tricky, because naive approaches require
waiting for all readers of the old sstables to finish before deleting them (since we can't
know if they have actually started opening the file yet; if they have not and we delete the
file first, they will error out).  The approach we have settled on is to not actually delete
old SSTables synchronously; instead we register a phantom reference with the garbage collector,
so when no references to the SSTable exist it will be deleted.  (We also write a compaction
marker to the file system so if the server is restarted before that happens, we clean out
the old SSTables at startup time.)
-      * A "major" compaction of merging _all_ sstables may be manually initiated by the user;
this results in submitMajor calling doCompaction with all the sstables in the !ColumnFamily,
rather than just sstables of similar size.
+      * Making this concurrency-safe without blocking writes or reads while we remove the
old SSTables from the list and add the new one is tricky.  We perform manual reference counting
on sstables during reads so that we know when they are safe to remove, e.g., !ColumnFamilyStore.getSSTablesForKey.
+      * Multiple !CompactionStrategies exist.  The original, !SizeTieredCompactionStrategy,
combines sstables that are similar in size.  This can result is a lot of wasted space in overwrite-intensive
workloads.  !LeveledCompactionStrategy provides stricter guarantees at the price of more compaction
i/o; see http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra and http://www.datastax.com/dev/blog/when-to-use-leveled-compaction
   * See [[ArchitectureSSTable]] and ArchitectureCommitLog for more details
  
  = Read path =

Mime
View raw message