cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Cassandra Wiki] Update of "DataModelv2" by EricEvans
Date Tue, 13 Apr 2010 23:59:16 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.

The "DataModelv2" page has been changed by EricEvans.
The comment on this change is: some inline feedback.


  We'll start from the bottom up, moving from the leaves of Cassandra's data structure (columns)
up to the root of the tree (the cluster).
+ {{{
+ From my experience, comparing concepts to those in a relational database pretty consistently
confuses people. This section intermixes the "structure of lists and maps" approach with relational
db comparisons "ColumnFamilies can be compared to a table in a relational database.", which
is probably worse still.
+ I also think it's best to avoid referring to column families as containers.
+ }}}
  == Columns ==
  A Column is also known as a Tuple (triplet), it contains a name, value and a timestamp.
+ {{{
+ This wording suggests that Tuple is a synonym for Column (which is not true).
+ }}}
  All values are supplied by the client, including the 'timestamp'. This means that clocks
on the clients should be synchronized (in the Cassandra server environment is useful also),
as these timestamps are used for conflict resolution. In many cases the 'timestamp' is not
used in client applications, and it becomes convenient to think of a column as a name/value
pair. For the remainder of this document, 'timestamps' will be elided for readability. It
is also worth noting the name and value are binary values, although in many applications they
are UTF8 serialized strings.
@@ -60, +70 @@

  In Cassandra, each column family is stored in a separate file, and the file is sorted in
row (i.e. key) major order. Related columns, those that you'll access together, should be
kept within the same column family.
+ {{{
+ IMO, you should avoid implementation details unless they are really relevant, as it distracts,
(i.e. "each column family is stored in a separate file").
+ }}}
  The row key is what determines what machine data is stored on. A key can be used for several
column families at the same time, this does however not imply that the data from these column
families is related. The semantics of having data for the same key in two different column
families is entirely up to the client. Also, the columns can be different between the two
column families. In fact there may be a virtually unlimited set of column names defined, which
leads to fairly common use of the column name as a piece of runtime populated data. This is
unusual in storage systems, particularly if you're coming from the relational database world.
For each key you can have data from multiple column families associated with it. However,
these are logically distinct, which is why the Thrift interface is oriented around accessing
one !ColumnFamily per key at a time. On the other hand, a number of methods within the Thrift
interface make use of this functionality, for example the batch_insert and batch_mutate make
it possible to insert or modify data in multiple !ColumnFamilies at the same time, as long
as the key for the different column families are the same. 
@@ -116, +130 @@

  The !SuperColumnFamily isn't much different from a normal !ColumnFamily except that it contains
a list of super columns per row instead of
  a list of columns. To following example defines a super column family in your storage-conf.xml:
+ {{{
+ IMO, the term "SuperColumnFamily" should die.
+ }}}
  An example configuration of an Authors !ColumnFamily using the UTF-8 sorting implementation
would be:

View raw message