Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 318F011ECA for ; Fri, 8 Aug 2014 11:23:22 +0000 (UTC) Received: (qmail 53230 invoked by uid 500); 8 Aug 2014 11:23:17 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 53198 invoked by uid 500); 8 Aug 2014 11:23:17 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 53176 invoked by uid 99); 8 Aug 2014 11:23:17 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Aug 2014 11:23:17 +0000 Date: Fri, 8 Aug 2014 11:23:16 +0000 (UTC) From: "Benedict (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14090655#comment-14090655 ] Benedict commented on CASSANDRA-7447: ------------------------------------- bq. CASSANDRA-7443 is good to clean up the code base and should be done first This, and further related patches, are a necessary prerequisite to this ticket, yes. bq. Using 32bit instead of 64bit pointers could also save some space. I would prefer not to go down this route just yet, as it is error prone to be optimising this in the first version. Any optimisations that can be made universally (i.e. guaranteed to be safe for all file sizes) I'm onboard with, but obfuscating code dependent on file size I'm not. Especially as this introduces an extra condition to execute on every single field access, potentially stalling the processor pipeline more readily. bq. Trie + byte ordered types: would this mean to do some "special serialization" e.g. for timeuuid to make them "binary comparable"? Yes bq. If one partition only contains one row, "plain" row-oriented storage seems to be more efficient. Is this what "small partition layout" is meant for? No, it is because it requires fewer disk accesses to have it all packed into the same block (or we can have smaller blocks, increasing IOPS esp. on SSDs). In fact it is quite reasonable to assume that even with single row partitions the column oriented storage will be more efficient, as the columns do not care about partitions; they extend across all partitions, and so the serialization costs are reduced even if there are no clustering columns. I should note that the presentation at ngcc is only for historical reference and to get familiar with the general discussion. As mentioned in the description of this ticket, I now favour a row-oriented approach backed by the new index structures for many of the non-optimal column-oriented use cases, which *may* reduce the necessity of a compact column-oriented form, although it would still be useful as just described. bq. Column names (CQL): I'd prefer to extend the table definition schema with an integer column id and use that. Could save lots of String.hashCode/equals() calls - even if the column-id is also used in native protocol. (Think this was discussed elsewhere) There is a separate ticket for this, and I consider it to be an orthogonal effort. We can more easily deliver it here than we can cross-cluster (personally I favour cross-cluster names to be supported by a general enum type (CASSANDRA-6917)) bq. Bikeshed: Is the term "sstable" still correct? The original sstable was only imposing a sort-order on the partition keys. This will still be imposed, so yes, but I don't have any strong attachment to it. bq. I didn't catch the point why only maps and sets don't naturally fit into columnar format but lists, strings and blobs do. Or is it just because of their mean serialized size? They don't logically fit because they are an extra dimension, much as static columns are one _fewer_ dimension. Columnar layouts really need fixed dimensionality. You can flatten maps, sets and lists (my list was not exhaustive), but this incurs significant cost and complexity on reading these across multiple sstables, as opposed to relying on the standard machinery. Strings and blobs can more trivially be split out into an extra file if they are too large (for simplicity of first delivery we can just append all values larger than some limit to a file, and replace them with their location in the file), but storing large strings in a columnar layout is generally not sensible/beneficial anyway. In all likelihood I think the best approach may be to permit collections and statics on column oriented tables by splitting them into a separate row-oriented sstable, at least in the near-term. The heap-blocks outlined in the ngcc talk could be delivered later, although I might be inclined to tell users that column oriented storage is not for them if they want to store these things in the table. > New sstable format with support for columnar layout > --------------------------------------------------- > > Key: CASSANDRA-7447 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7447 > Project: Cassandra > Issue Type: Improvement > Components: Core > Reporter: Benedict > Assignee: Benedict > Labels: performance, storage > Fix For: 3.0 > > Attachments: ngcc-storage.odp > > > h2. Storage Format Proposal > C* has come a long way over the past few years, and unfortunately our storage format hasn't kept pace with the data models we are now encouraging people to utilise. This ticket proposes a collections of storage primitives that can be combined to serve these data models more optimally. > It would probably help to first state the data model at the most abstract level. We have a fixed three-tier structure: We have the partition key, the clustering columns, and the data columns. Each have their own characteristics and so require their own specialised treatment. > I should note that these changes will necessarily be delivered in stages, and that we will be making some assumptions about what the most useful features to support initially will be. Any features not supported will require sticking with the old format until we extend support to all C* functionality. > h3. Partition Key > * This really has two components: the partition, and the value. Although the partition is primarily used to distribute across nodes, it can also be used to optimise lookups for a given key within a node > * Generally partitioning is by hash, and for the moment I want to focus this ticket on the assumption that this is the case > * Given this, it makes sense to optimise our storage format to permit O(1) searching of a given partition. It may be possible to achieve this with little overhead based on the fact we store the hashes in order and know they are approximately randomly distributed, as this effectively forms an immutable contiguous split-ordered list (see Shalev/Shavit, or CASSANDRA-7282), so we only need to store an amount of data based on how imperfectly distributed the hashes are, or at worst a single value per block. > * This should completely obviate the need for a separate key-cache, which will be relegated to supporting the old storage format only > h3. Primary Key / Clustering Columns > * Given we have a hierarchical data model, I propose the use of a cache-oblivious trie > * The main advantage of the trie is that it is extremely compact and _supports optimally efficient merges with other tries_ so that we can support more efficient reads when multiple sstables are touched > * The trie will be preceded by a small amount of related data; the full partition key, a timestamp epoch (for offset-encoding timestamps) and any other partition level optimisation data, such as (potentially) a min/max timestamp to abort merges earlier > * Initially I propose to limit the trie to byte-order comparable data types only (the number of which we can expand through translations of the important types that are not currently) > * Crucially the trie will also encapsulate any range tombstones, so that these are merged early in the process and avoids re-iterating the same data > * Results in true bidirectional streaming without having to read entire range into memory > h3. Values > There are generally two approaches to storing rows of data: columnar, or row-oriented. The above two data structures can be combined with a value storage scheme that is based on either. However, given the current model we have of reading large 64Kb blocks for any read, I am inclined to focus on columnar support first, as this delivers order-of-magnitude benefits to those users with the correct workload, while for most workloads our 64Kb blocks are large enough to store row-oriented data in a column-oriented fashion without any performance degradation (I'm happy to consign very large row support to phase 2). > Since we will most likely target both behaviours eventually, I am currently inclined to suggest that static columns, sets and maps be targeted for a row-oriented release, as they don't naturally fit in a columnar layout without secondary heap-blocks. This may be easier than delivering heap-blocks also, as it keeps both implementations relatively clean. This is certainly open to debate, and I have no doubt there will be conflicting opinions here. > Focusing on our columnar layout, the goals are: > * Support sparse and dense column storage > * Efficient compression of tombstones, timestamps, ttls, etc. into near-zero space based on offset/delta/bitmap encoding > * Normalisation of column names once per file > * Per-file block-layout index, defining how each block's data is encoded, so we can index directly within a block for dense fields (permitting more efficient page cache utilisation) > * Configurable grouping of fields per block, so that we can get closer to row-oriented or column-oriented performance, based on user goals > I have attached my slides from the ngcc for reference. -- This message was sent by Atlassian JIRA (v6.2#6252)