cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Cassandra Wiki] Update of "ArchitectureInternals_ZH" by HubertChang
Date Tue, 13 Apr 2010 08:18:25 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.

The "ArchitectureInternals_ZH" page has been changed by HubertChang.
http://wiki.apache.org/cassandra/ArchitectureInternals_ZH?action=diff&rev1=16&rev2=17

--------------------------------------------------

  ## page was copied from ArchitectureInternals
- = General =
-  * Configuration file is parsed by !DatabaseDescriptor (which also has all the default values,
if any)
-  * Thrift generates an API interface in Cassandra.java; the implementation is !CassandraServer,
and !CassandraDaemon ties it together.
-  * !CassandraServer turns thrift requests into the internal equivalents, then !StorageProxy
does the actual work, then !CassandraServer turns it back into thrift again
-  * !StorageService is kind of the internal counterpart to !CassandraDaemon.  It handles
turning raw gossip into the right internal state.
-  * !AbstractReplicationStrategy controls what nodes get secondary, tertiary, etc. replicas
of each key range.  Primary replica is always determined by the token ring (in !TokenMetadata)
but you can do a lot of variation with the others.  !RackUnaware just puts replicas on the
next N-1 nodes in the ring.  !RackAware puts the first non-primary replica in the next node
in the ring in ANOTHER data center than the primary; then the remaining replicas in the same
as the primary.
-  * !MessagingService handles connection pooling and running internal commands on the appropriate
stage (basically, a threaded executorservice).  Stages are set up in !StageManager; currently
there are read, write, and stream stages.  (Streaming is for when one node copies large sections
of its SSTables to another, for bootstrap or relocation on the ring.)  The internal commands
are defined in !StorageService; look for `registerVerbHandlers`.
+ ##master-page:ArchitectureInternals
+ #format wiki
+ #language zh
+ == 概述 ==
+  * 对Cassandra的访问通过Thrift或者Avro这些服务开发框架进行支持。这些开发框架支持一定的客户端和服务器端的访问界面代码生成。在Cassandra数据库中,它生成Cassandra.java,内部包含Cassandra.IFace,
Cassandra.IFace的客户端实现为Cassandra.Client, 服务器端实现是CassandraServer。(译者加入)
+  * Cassandra的启动在CassandraDaemon内执行,启动时通过DatabaseDescriptor完成对配置文件的解析(在需要的情况下DatabaseDescriptor提供一些配置的默认值)

+  * 启动后,!CassandraDaemon对客户端提供网络连接的能力;其接收的客户端请求由!CassandraServer响应和执行。
+  * CassandraServer把客户端请求转换成一一对应的内部请求。StorageProxy完成这些请求相应的具体工作。其结果由CassandraServer返回给客户端。
+  * StorageService处理Cassandra集群内部之间的请求和响应,可以看作它是集群内部的CassandraDaemon。它把集群内部的原始gossip转换成对应的内部状态。(译者注:StroageProxy的方法最后调用StorageService来完成在集群内的访问)
+  * AbstractReplicationStrategy控制每一个键值范围对应的第二份、第三份复制对应的节点。主本数据存储节点由该数据对应的Token以及其它变量所决定, 如果复制策略是机架不相关的,主本对应的复制存储在Token序列对应的接下来的N-1个节点上。如果复制策略是机架相关的,首先主本对应的一个复制是存储在另外一个机架上,而其它N-2(假设N>2)个复制和主本处于同一机架,存储在Token序列内依次的N-2个节点上。
+  * MessagingService负责处理内部的连接池,以及在(通过一个多线程的Executorservice)合适的stage上运行内部命令。stage由StorageManager进行管理, 现在有读stage,
写stage,和流处理stage(在Cassandra bootstrap或者token的重新定义时,需要移动大量sstable的数据,这个时候进行的就是Streaming的处理)。 (译者注:stage不是阶段,也不是舞台,是一个有时空概念的定义,具体的内涵请查阅Cassandra源代码或者读SEDA方面的资料)。stage上运行的内部命令由StorageService定义,可参考"registerVerbHandlers"
  
- = Write path =
-  * !StorageProxy gets the nodes responsible for replicas of the keys from the !ReplicationStrategy,
then sends !RowMutation messages to them.
-    * If nodes are changing position on the ring, "pending ranges" are associated with their
destinations in !TokenMetadata and these are also written to.
-    * If nodes that should accept the write are down, but the remaining nodes can fulfill
the requested !ConsistencyLevel, the writes for the down nodes will be sent to another node
instead, with a header (a "hint") saying that data associated with that key should be sent
to the replica node when it comes back up.  This is called HintedHandoff and reduces the "eventual"
in "eventual consistency."  Note that HintedHandoff is only an '''optimization'''; ArchitectureAntiEntropy
is responsible for restoring consistency more completely.
-  * on the destination node, !RowMutationVerbHandler uses Table.Apply to hand the write first
to !CommitLog.java, then to the Memtable for the appropriate !ColumnFamily.
-  * When a Memtable is full, it gets sorted and written out as an SSTable asynchronously
by !ColumnFamilyStore.switchMemtable
-    * When enough SSTables exist, they are merged by !ColumnFamilyStore.doFileCompaction
-      * Making this concurrency-safe without blocking writes or reads while we remove the
old SSTables from the list and add the new one is tricky, because naive approaches require
waiting for all readers of the old sstables to finish before deleting them (since we can't
know if they have actually started opening the file yet; if they have not and we delete the
file first, they will error out).  The approach we have settled on is to not actually delete
old SSTables synchronously; instead we register a phantom reference with the garbage collector,
so when no references to the SSTable exist it will be deleted.  (We also write a compaction
marker to the file system so if the server is restarted before that happens, we clean out
the old SSTables at startup time.)
-  * See [[ArchitectureSSTable]] and ArchitectureCommitLog for more details
- 
- = Read path =
-  * !StorageProxy gets the nodes responsible for replicas of the keys from the !ReplicationStrategy,
then sends read messages to them
-    * This may be a !SliceFromReadCommand, a !SliceByNamesReadCommand, or a !RangeSliceReadCommand,
depending
-  * On the data node, !ReadVerbHandler gets the data from CFS.getColumnFamily or CFS.getRangeSlice
and sends it back as a !ReadResponse
-    * The row is located by doing a binary search on the index in SSTableReader.getPosition
-    * For single-row requests, we use a !QueryFilter subclass to pick the data from the Memtable
and SSTables that we are looking for.  The Memtable read is straightforward.  The SSTable
read is a little different depending on which kind of request it is:
-      * If we are reading a slice of columns, we use the row-level column index to find where
to start reading, and deserialize block-at-a-time (where "block" is the group of columns covered
by a single index entry) so we can handle the "reversed" case without reading vast amounts
into memory
-      * If we are reading a group of columns by name, we still use the column index to locate
each column, but first we check the row-level bloom filter to see if we need to do anything
at all
-    * The column readers provide an Iterator interface, so the filter can easily stop when
it's done, without reading more columns than necessary
-      * Since we need to potentially merge columns from multiple SSTable versions, the reader
iterators are combined through a !ReducingIterator, which takes an iterator of uncombined
columns as input, and yields combined versions as output
-  * If a quorum read was requested, !StorageProxy waits for a majority of nodes to reply
and makes sure the answers match before returning.  Otherwise, it returns the data reply as
soon as it gets it, and checks the other replies for discrepancies in the background in !StorageService.doConsistencyCheck.
 This is called "read repair," and also helps achieve consistency sooner.
-    * As an optimization, !StorageProxy only asks the closest replica for the actual data;
the other replicas are asked only to compute a hash of the data.
- 
- = Gossip =
-  * based on "Efficient reconciliation and flow control for anti-entropy protocols:" http://www.cs.cornell.edu/home/rvr/papers/flowgossip.pdf
-  * See ArchitectureGossip for more details
- 
- = Failure detection =
-  * based on "The Phi accrual failure detector:" http://vsedach.googlepages.com/HDY04.pdf
- 
- = Further reading =
-  * The idea of dividing work into "stages" with separate thread pools comes from the famous
SEDA paper: http://www.eecs.harvard.edu/~mdw/papers/seda-sosp01.pdf
-  * Crash-only design is another broadly applied principle.  [[http://lwn.net/Articles/191059/|Valerie
Henson's LWN article]] is a good introduction
-  * Cassandra's distribution is closely related to the one presented in Amazon's Dynamo paper.
 Read repair, adjustable consistency levels, hinted handoff, and other concepts are discussed
there.  This is required background material: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html.
 The related article on [[http://www.allthingsdistributed.com/2008/12/eventually_consistent.html|article
on eventual consistency]] is also relevant.
-  * Cassandra's on-disk storage model is loosely based on sections 5.3 and 5.4 of [[http://labs.google.com/papers/bigtable.html|the
Bigtable paper]].
-  * Facebook's Cassandra team authored a paper on Cassandra for LADIS 09: http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf.
Most of the information there is applicable to Apache Cassandra (the main exception is the
integration of !ZooKeeper).
- 

Mime
View raw message