cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Ellis (JIRA)" <>
Subject [jira] Commented: (CASSANDRA-674) New SSTable Format
Date Tue, 12 Jan 2010 17:07:54 GMT


Jonathan Ellis commented on CASSANDRA-674:

ISTM that Slice is trying to solve the problem "how do I avoid repeating the Key/SC name w/
each column entry, now that I have moved to a global index."  This is the central difficulty
with this approach.  So, I definitely agree that we need a concept that means "all the columns
w/ the same parent" (sort of like the existing IColumnContainer) but I don't think Slice as
it exists here is the right one.  I would rather see the "things with the same parent' concept
be an iterator, with metadata from a separate file (like the current key index) used to determine
begin/end, rather than have an object inside a block that you need to (potentially) assemble
multiple of to get the "things with the same parent" concept.

I also think that if I were doing this myself I would probably make part 1 be a conversion
to the global index and just inefficiently repeat the Key/SC data, and then try to make it
efficient with the Slice/iterator-thing next.  But that is just a first impression I am throwing
out fwiw. :)

> New SSTable Format
> ------------------
>                 Key: CASSANDRA-674
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.9
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.9
>         Attachments: 674-v1.diff, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
> Various tickets exist due to limitations in the SSTable file format, including #16, #47
and #328. Attached is a proposed design/implementation of a new file format for SSTables that
addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which
I'll describe in the comments.
> The file format is described in the javadoc for the class, but
>  * Blocks are opaque (except for their header) so that they can be compressed. The index
file contains an entry for the first key in every Block. Blocks contain Slices.
>  * Slices are series of columns with the same parents and (deletion) metadata. They can
be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth).
A single CF can be split across multiple Slices, which can be split across multiple blocks.
>  * Neither Slices nor Blocks have a fixed size or maximum length, but they each have
target lengths which can be stretched and broken by very large columns.
> The most interesting concepts from this patch are:
>  * Block compression is possible (currently using GZIP, which has one bug mentioned in
the comments),
>  * Compaction involves merging intersecting Slices from input SSTables. Since large rows
will be broken down into multiple slices, only the portions of rows that intersect between
tables need to be deserialized/merged/held-in-memory,
>  * Indexes for individual rows are gone, since the global index allows random access
to the middle of column families that span Blocks, and Slices allow batches of columns to
be skipped within a Block.
>  * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys
instead, meaning that a query for a column that doesn't exist in a row that does will often
not need to seek to the row.
>  * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns
are defined recursively, so deeply nested slices are possible,
>  * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning
that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This
allows for eventually consistent range deletes of columns.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message