cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout
Date Mon, 01 Sep 2014 09:25:22 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117240#comment-14117240
] 

Sylvain Lebresne commented on CASSANDRA-7447:
---------------------------------------------

bq. you seem to have brushed over my point about delivering this project incrementally

I'm completely on board about doing it incrementally. I'm just not sure what is intrinsically
less incremental in what I'm suggesting. We both agree that we want a modular file-format,
we just disagree on what components we should start with. I'm saying we should start with
generic components with no more ambition than to be no less efficient than what we have currently
(but let's be clear that it's not very hard to improve on at least: avoiding repeating clustering
columns, using a compact encoding for CQL column names, and use offset-encoding for timestamps,
all of which are part of what you are suggesting for the first iteration anyway). You want
to start with specialized components that are somewhat optimal for some workloads. I claim
that there is a lot more to debate in what you're suggesting and I certainly refute the claim
that what I'm suggesting is less digestible and testable (I would argue to the contrary in
fact).

bq. It also happens to allow us to guarantee delivery of high quality functionality in time
for 3.0, even if it's possibly only a subset (though a good chance we can hit all of it),
instead of delivering fairly average functionality that covers all users

I do prefer delivering moderate improvements for all cases first and risking that the bigger
but more restricted improvement slip by, because this also mean reducing the risk of having
to maintain the old format longer (and as said above, I do claim it's more digestible/incremental
in practice). In fact, your comments so far have strongly suggested that having the file format
cover everything was not even your second step.

bq. However even if the goal of eliminating the legacy format is decided upon, it seems best
to leave this until much nearer GA as if we can deliver the full featureset before GA this
is simply wasted effort.

And to be clear, I respectfully disagree (that it's best, but I also don't see how it's wasted
effort). I ask that we collectively decide both on 1) whether or not eliminating the old format
is a goal (I'm very very strongly opposed to this not being a goal at all personally) and
2) whether or not we should start by covering everything first and optimize later for some
cases versus starting by covering specific subsets first and add the more general part second.

> New sstable format with support for columnar layout
> ---------------------------------------------------
>
>                 Key: CASSANDRA-7447
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benedict
>            Assignee: Benedict
>              Labels: performance, storage
>             Fix For: 3.0
>
>         Attachments: ngcc-storage.odp
>
>
> h2. Storage Format Proposal
> C* has come a long way over the past few years, and unfortunately our storage format
hasn't kept pace with the data models we are now encouraging people to utilise. This ticket
proposes a collections of storage primitives that can be combined to serve these data models
more optimally.
> It would probably help to first state the data model at the most abstract level. We have
a fixed three-tier structure: We have the partition key, the clustering columns, and the data
columns. Each have their own characteristics and so require their own specialised treatment.
> I should note that these changes will necessarily be delivered in stages, and that we
will be making some assumptions about what the most useful features to support initially will
be. Any features not supported will require sticking with the old format until we extend support
to all C* functionality.
> h3. Partition Key
> * This really has two components: the partition, and the value. Although the partition
is primarily used to distribute across nodes, it can also be used to optimise lookups for
a given key within a node
> * Generally partitioning is by hash, and for the moment I want to focus this ticket on
the assumption that this is the case
> * Given this, it makes sense to optimise our storage format to permit O(1) searching
of a given partition. It may be possible to achieve this with little overhead based on the
fact we store the hashes in order and know they are approximately randomly distributed, as
this effectively forms an immutable contiguous split-ordered list (see Shalev/Shavit, or CASSANDRA-7282),
so we only need to store an amount of data based on how imperfectly distributed the hashes
are, or at worst a single value per block.
> * This should completely obviate the need for a separate key-cache, which will be relegated
to supporting the old storage format only
> h3. Primary Key / Clustering Columns
> * Given we have a hierarchical data model, I propose the use of a cache-oblivious trie
> * The main advantage of the trie is that it is extremely compact and _supports optimally
efficient merges with other tries_ so that we can support more efficient reads when multiple
sstables are touched
> * The trie will be preceded by a small amount of related data; the full partition key,
a timestamp epoch (for offset-encoding timestamps) and any other partition level optimisation
data, such as (potentially) a min/max timestamp to abort merges earlier
> * Initially I propose to limit the trie to byte-order comparable data types only (the
number of which we can expand through translations of the important types that are not currently)
> * Crucially the trie will also encapsulate any range tombstones, so that these are merged
early in the process and avoids re-iterating the same data
> * Results in true bidirectional streaming without having to read entire range into memory
> h3. Values
> There are generally two approaches to storing rows of data: columnar, or row-oriented.
The above two data structures can be combined with a value storage scheme that is based on
either. However, given the current model we have of reading large 64Kb blocks for any read,
I am inclined to focus on columnar support first, as this delivers order-of-magnitude benefits
to those users with the correct workload, while for most workloads our 64Kb blocks are large
enough to store row-oriented data in a column-oriented fashion without any performance degradation
(I'm happy to consign very large row support to phase 2). 
> Since we will most likely target both behaviours eventually, I am currently inclined
to suggest that static columns, sets and maps be targeted for a row-oriented release, as they
don't naturally fit in a columnar layout without secondary heap-blocks. This may be easier
than delivering heap-blocks also, as it keeps both implementations relatively clean. This
is certainly open to debate, and I have no doubt there will be conflicting opinions here.
> Focusing on our columnar layout, the goals are:
> * Support sparse and dense column storage
> * Efficient compression of tombstones, timestamps, ttls, etc. into near-zero space based
on offset/delta/bitmap encoding
> * Normalisation of column names once per file
> * Per-file block-layout index, defining how each block's data is encoded, so we can index
directly within a block for dense fields (permitting more efficient page cache utilisation)
> * Configurable grouping of fields per block, so that we can get closer to row-oriented
or column-oriented performance, based on user goals
> I have attached my slides from the ngcc for reference.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message