cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Cassandra Wiki] Update of "FileFormatDesignDoc" by StuHood
Date Wed, 05 Jan 2011 09:23:29 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.

The "FileFormatDesignDoc" page has been changed by StuHood.
The comment on this change is: Clarified the metadata discussion, added horizontal rules.
http://wiki.apache.org/cassandra/FileFormatDesignDoc?action=diff&rev1=15&rev2=16

--------------------------------------------------

   * Space efficient when un-compressed: remove redundancy
   * Random access to the middle of wide rows
   * Arbitrary nesting
+  * Range tombstones (for range/slice deletes)
  
  == Influences ==
  
   * Google Dremel [1] - Arbitrarily nested, field-oriented serialization
   * Hive RCFile [2] - Column-group-oriented storage
+ 
+ ----
  
  == Current Implementation ==
  
@@ -48, +51 @@

  
  Finally, there is a second type of redundancy that the current design does not tackle: the
column names at level "name2" are frequently repeated, but since rows are stored independently,
we don't normalize those values. For narrow rows (like those shown), removing this redundancy
will be our largest win.
  
- == High level ==
+ ==== Metadata ====
+ 
+ Metadata is currently implemented such that column parents have metadata that covers their
entire range: this means that you cannot delete arbitrary slices, only exact keys or names.
+ 
+ ----
+ 
+ == Proposed Implementation ==
  
  Because we will be storing multiple columns per SSTable, our design will bear the most similarity
to RCFile (rather than the column-per-file approach taken in Dremel). But because we allow
for nesting via super columns (and hopefully with a more flexible representation in the future),
we need to take hints from Dremel's serialization to allow for efficient storage of parent
and null information.
  
  === Vertical chunks ===
  
- Rather than slicing the span into chunks horizontally, we will use vertical chunks (and
horizontal chunks only when necessary for particularly wide rows):
+ Rather than slicing the span into chunks horizontally, we will use vertical chunks (and
break particularly wide rows into multiple spans):
  
  || ''row key'' ||
  || cheese  ||
@@ -122, +131 @@

  
  The parent change flag can be represented compactly using a bitmap, and field lengths can
be packed tightly into group-varint encoded arrays [3], as alluded to in the Dremel paper,
and mentioned in Jeff Dean's talks.
  
- === Metadata ===
- 
- Cassandra also needs to encode metadata about tuples and ranges of tuples, in order to represent
creation and deletion timestamps: range tuples can be encoded in a similar fashion to the
value tuples represented here, and the metadata timestamps can be group-varint encoded.
- 
  === Field reordering ===
  
  One weakness of the implementation so far is that it doesn't allow tuples to be reordered
within a level. This approach performs well for wide rows with high field cardinality, since
adding compression is unlikely to remove data.
@@ -154, +159 @@

  
  === Summary ===
  
- The final (simplified) representation of the span is:
+ A (simplified) representation of the span so far (without metadata) is:
  
  ''(parent-ordered)''
  || ''row key'' || ''parent_change'' ||
@@ -184, +189 @@

  || || 0 ||
  || china || 1 ||
  
+ == Metadata ==
+ 
+ Cassandra also needs to encode metadata about tuples and ranges of tuples in order to represent
creation and deletion timestamps. For both value tuples and range tuples, a varying number
(depending on value and range type) of timestamps will also need to be encoded.
+ 
+ === Range Metadata ===
+ 
+ Range tuples can be encoded in a very similar fashion to the value tuples represented above,
except that they always come in pairs. It will likely make sense to store them in a separate
blob from the value tuples, since they will bear very little similarity to one another (TODO:
need to confirm with an anecdote or two).
+ 
+ || ''name1'' - ''left'' || ''name1'' - ''right''  || ''parent_change'' ||
+ || havarti || muenster || 0 ||
+ || || || 1 ||
+ 
+ This example shows a range tombstone for values at level "name1" between 'havarti' and 'muenster':
the chunk for the "name1" level stores a pair of range tuples for the 'cheese' parent and
a nulls are stored for parents without any range metadata. The end result is that the span
stores a tombstone from ('cheese', 'havarti', <empty>) to ('cheese', 'muenster', null),
where <empty> is the smallest value, and null is the largest value.
+ 
+ Note that it is not possible for ranges for a parent to overlap: in this case, the ranges
would be resolved such that the intersection was given the winning timestamp, and the two
remainders would use their original timestamps.
+ 
+ ==== Effect of ordering ====
+ 
+ When a chunk is marked as ''self'' ordered, range metadata should be affected as well: therefore,
the number of ranges that need to be represented in a chunk should also factor into the cardinality
threshold that toggles a chunk between ''self'' and ''parent'' ordering.
+ 
+ === Value Metadata ===
+ 
+ Unlike range metadata, value (column) metadata will only be stored in the last chunk of
a span (since the entire span is used to represent the path to the column, you don't really
care about metadata until you've built the full path to the value).
+ 
+ ----
+ 
  == Schema ==
  
  To iterate quickly, the initial versions of the file format will be implemented using Avro.
The current Avro schema:

Mime
View raw message