cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Cassandra Wiki] Update of "FileFormatDesignDoc" by StuHood
Date Sun, 02 Jan 2011 02:33:39 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.

The "FileFormatDesignDoc" page has been changed by StuHood.
http://wiki.apache.org/cassandra/FileFormatDesignDoc?action=diff&rev1=8&rev2=9

--------------------------------------------------

  
  Because we will be storing multiple columns per SSTable, our design will bear the most similarity
to RCFile (rather than the column-per-file approach taken in Dremel). But because we allow
for nesting via super columns (and hopefully with a more flexible representation in the future),
we need to take hints from Dremel's serialization to allow for efficient storage of parent
and null information.
  
+ === Vertical chunks ===
+ 
- Rather than dividing the table representation into chunks using horizontal slices, we will
use vertical chunks (and horizontal chunks when necessary for particularly wide rows):
+ Rather than slicing the span into chunks horizontally, we will use vertical chunks (and
horizontal chunks only when necessary for particularly wide rows):
  
  || ''row key'' ||
  || cheese  ||
@@ -83, +85 @@

  || 4.9 ||
  || china ||
  
- This representation achieves the benefits for compression shown in the RCFile paper: similar
values are always stored together. But we have lost some information!: Using the tables above,
it is impossible to determine which fields at level "name1" are cheeses, and which are fruits.
We need to store parent information, and one method comes from Dremel's clever representation
for arbitrary nesting. We add a single boolean to each tuple that toggles to represent parent
changes:
+ This representation achieves some of the benefits for compression shown in the RCFile paper:
similar values are stored much closer together. But we have lost some information!: Using
the tables above, it is impossible to determine which fields at level "name1" are cheeses,
and which are fruits.
+ 
+ === Parent representation ===
+ 
+ We need to store parent information, and one method comes from Dremel's clever representation
for arbitrary nesting. We add a single boolean to each tuple that toggles to represent parent
changes:
  
  || ''row key'' || ''parent_change'' ||
  || cheese  || 0 ||
@@ -116, +122 @@

  
  The parent change flag can be represented compactly using a bitmap, and field lengths can
be packed tightly into group-varint encoded arrays [3], as alluded to in the Dremel paper,
and mentioned in Jeff Dean's talks.
  
+ === Metadata ===
+ 
- Cassandra also needs to encode metadata about tuples and ranges of tuples, in order to represent
creation and deletion timestamps: range tuples can be encoded in a similar fashion to the
value tuples represented here, and the metadata itself can also be group-varint encoded.
+ Cassandra also needs to encode metadata about tuples and ranges of tuples, in order to represent
creation and deletion timestamps: range tuples can be encoded in a similar fashion to the
value tuples represented here, and the metadata timestamps can be group-varint encoded.
+ 
+ === Field reordering ===
+ 
+ One weakness of the implementation so far is that it doesn't allow tuples to be reordered
within a level. This approach performs well for wide rows with high field cardinality, since
adding compression is unlikely to remove data.
+ 
+ Since we have domain knowledge that a compression algorithm would not, it will often be
more efficient to perform reordering by ourselves, particularly when a chunk has low cardinality:
for example at the "name2" level above. By assigning the chunk an ''ordered'' type, we can
store the fields in sorted order (rather than in parent-sorted order) and remove duplicates.
+ 
+ || ''name2'' ||
+ || flavor ||
+ || origin  ||
+ 
+ More importantly, a chunk of type ''ordered'' should influence the order of tuples in child
chunks. When we encounter an ''ordered'' chunk at level "name2", we should expect its children
in level "value" to be arranged as follows:
+ 
+ || ''value'' || ''parent_change'' ||
+ || 3.4 || 1 ||
+ || 5.6 || 1 ||
+ || 2.6 || 1 ||
+ || 4.2 || 1 ||
+ || 4.9 || 1 ||
+ || || 0 ||
+ || france || 1 ||
+ || || 0 ||
+ || || 0 ||
+ || china || 1 ||
+ 
+ The ''parent_change'' field is now a bitmap representing nulls: it indicates that all parents
have a 'flavor' tuple, but only the second and fifth parents have an 'origin' tuple. This
representation is ripe for compression.
  
  == Schema ==
  

Mime
View raw message