cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Cassandra Wiki] Update of "FileFormatDesignDoc" by StuHood
Date Sun, 02 Jan 2011 01:30:17 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.

The "FileFormatDesignDoc" page has been changed by StuHood.
http://wiki.apache.org/cassandra/FileFormatDesignDoc?action=diff&rev1=6&rev2=7

--------------------------------------------------

  
  == Requirements ==
  
- * Compression
+  * Compression
- * Space efficient when un-compressed: remove redundancy
+  * Space efficient when un-compressed: remove redundancy
- * Random access to the middle of wide rows
+  * Random access to the middle of wide rows
- * Arbitrary nesting
+  * Arbitrary nesting
  
  == Influences ==
  
- * Google Dremel - Arbitrarily nested, field-oriented serialization - http://sergey.melnix.com/pub/melnik_VLDB10.pdf
- * Hive RCFile - Column-group-oriented storage - http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-4.pdf
+  * Google Dremel [1] - Arbitrarily nested, field-oriented serialization
+  * Hive RCFile [2] - Column-group-oriented storage
  
  == Current Implementation ==
  
  To avoid overloading words too heavily, we will call a series of ordered rows in an SSTable
a 'span'. Since SSTables contain data for a single column family in sorted order, spans share
these properties as well. One way to visualize a simple span with depth 3 (aka, containing
super columns) is to arrange it as a table:
  
- || row key || name1  || name2 || value ||
+ || ''row key'' || ''name1''  || ''name2'' || ''value'' ||
  || cheese  || brie || flavor || 3.4   ||
  || cheese  || gouda || flavor  || 5.6   ||
  || cheese  || gouda || origin  || france   ||
@@ -31, +31 @@

  
  This representation of a span involves a lot of redundancy in representing whether the parent
of a particular column has changed. If we remove that particular redundancy, we get a tree
structure:
  
- || row key || name1  || name2 || value ||
+ || ''row key'' || ''name1''  || ''name2'' || ''value'' ||
  || cheese  || brie || flavor || 3.4   ||
  ||         || gouda || flavor  || 5.6   ||
  ||         ||       || origin  || france   ||
@@ -40, +40 @@

  ||         || pear  || flavor || 4.9 ||
  ||         ||       || origin || china ||
  
- The current implementation of SSTables lays data out on disk in approximately this way:
data for rows is stored contiguously, and one must seek to the "root" of the tree for a row
in order to read the row index and determine where the next level of the tree is stored. But
only the first level of the tree is indexed: in order to find a particular column at the level
labeled "name2", you would need to deserialize all columns at that level.
+ The current implementation of SSTables lays data out on disk in approximately this way:
data for rows is stored contiguously. In relation to the table representation above, we divide
the tree into pieces using horizontal "chunks". One must seek to the "root" of the tree for
a row in order to read the row index and determine which chunk the next level of the tree
is stored in.
  
+ Additionally, only the first level of the tree is indexed: in order to find a particular
column at the level labeled "name2", you would need to deserialize all columns at that level.
+ 
- Additionally, there is a second type of redundancy that we do not tackle at the moment:
the column names at level "name2" are frequently repeated, but since rows are stored independently,
we don't normalize those values. For narrow rows (like those shown), removing this redundancy
will be our largest win.
+ Finally, there is a second type of redundancy that the current design does not tackle: the
column names at level "name2" are frequently repeated, but since rows are stored independently,
we don't normalize those values. For narrow rows (like those shown), removing this redundancy
will be our largest win.
  
  == High level ==
  
- Because we will be storing multiple columns per SSTable, our design will bear the most similarity
to RCFile (rather than the column-per-file approach taken in Dremel). But because we allow
for nesting via super columns, we need to take hints from Dremel's serialization to allow
for efficient storage of parent and null information.
+ Because we will be storing multiple columns per SSTable, our design will bear the most similarity
to RCFile (rather than the column-per-file approach taken in Dremel). But because we allow
for nesting via super columns (and hopefully with a more flexible representation in the future),
we need to take hints from Dremel's serialization to allow for efficient storage of parent
and null information.
+ 
+ Rather than dividing the table representation into chunks using horizontal slices, we will
use vertical chunks (and horizontal chunks when necessary for particularly wide rows):
+ 
+ || ''row key'' ||
+ || cheese  ||
+ || fruit   ||
+ 
+ || ''name1''  ||
+ || brie ||
+ || gouda ||
+ || swiss ||
+ || apple ||
+ || pear  ||
+ 
+ || ''name2'' ||
+ || flavor ||
+ || flavor ||
+ || origin  ||
+ || flavor ||
+ || flavor ||
+ || flavor ||
+ || origin  ||
+ 
+ || ''value'' ||
+ || 3.4 ||
+ || 5.6 ||
+ || france ||
+ || 2.6 ||
+ || 4.2 ||
+ || 4.9 ||
+ || china ||
+ 
+ This representation achieves the benefits for compression shown in the RCFile paper: similar
values are always stored together. But we have lost some information!: Using the tables above,
it is impossible to determine which fields at level "name1" are cheeses, and which are fruits.
We need to store parent information, and our method should come from Dremel's clever representation
for arbitrary nesting. We add a single boolean to each tuple that toggles to represent parent
changes:
+ 
+ || ''row key'' || ''parent_change'' ||
+ || cheese  || 0 ||
+ || fruit   || 0 ||
+ 
+ || ''name1''  || ''parent_change'' ||
+ || brie || 0 ||
+ || gouda || 0 ||
+ || swiss || 0 ||
+ || apple || 1 ||
+ || pear  || 1 ||
+ 
+ || ''name2'' || ''parent_change'' ||
+ || flavor || 0 ||
+ || flavor || 1 ||
+ || origin  || 1 ||
+ || flavor || 0 ||
+ || flavor || 1 ||
+ || flavor || 0 ||
+ || origin  || 0 ||
+ 
+ || ''value'' || ''parent_change'' ||
+ || 3.4 || 0 ||
+ || 5.6 || 1 ||
+ || france || 0 ||
+ || 2.6 || 1 ||
+ || 4.2 || 0 ||
+ || 4.9 || 1 ||
+ || china || 0 ||
+ 
+ The parent change flag can be represented compactly using a bitmap, and field lengths can
be packed tightly into group-varint encoded arrays [3], as alluded to in the Dremel paper,
and mentioned in Jeff Dean's talks.
+ 
+ Cassandra also needs to encode metadata about tuples and ranges of tuples, in order to represent
creation and deletion timestamps: range tuples can be encoded in a similar fashion to the
value tuples represented here, and the metadata itself can also be group-varint encoded.
  
  == Schema ==
  
- To iterate quickly, the initial versions of the file format will be implemented using Avro.
+ To iterate quickly, the initial versions of the file format will be implemented using Avro.
The current Avro schema:
  
+ {{{#!highlight c++
+ /**
+  * The largest record in an SSTable: "chunk_count" consecutive chunks represent
+  * a span of tuples, where each chunk represents "arity" levels of nesting. The
+  * chunks in a span will have arities that sum to the total depth of the SSTable.
+  */
+ record Chunk {
+  /** Count of chunks in the current span. */
+  int chunk_count;
+  /** Levels of nesting represented by this chunk. */
+  int arity;
+ 
+  /** Tuples of size 'arity'. */
+  array<bytes> values;
+  /** Pairs of tuples of size 'arity'. */
+  array<bytes> ranges;
+ 
+  /**
+   * Packed metadata for values: only set in the last chunk of a span.
+   * type: deleted, expiring, standard
+   * parent: a single bit to represent parent changes
+   */
+  bytes value_metadata;
+  /**
+   * Packed metadata for ranges.
+   * type: deleted, null (null indicates that the range has default metadata)
+   * parent: a single bit to represent parent changes
+   */
+  bytes range_metadata;
+ 
+  /* TODO: Dumb timestamp encoding: should offset and group varint encode. */
+  /**
+   * A variable number of timestamps depending on value type: only set in the last chunk
of a span.
+   */
+  array<long> value_timestamps;
+  /**
+   * A variable number of timestamps depending on range type.
+   */
+  array<long> range_timestamps;
+ }
+ }}}
+ 
+ == References ==
+ 
+  * [1] http://sergey.melnix.com/pub/melnik_VLDB10.pdf
+  * [2] http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-4.pdf
+  * [3] https://issues.apache.org/jira/browse/AVRO-679
+ 

Mime
View raw message