Return-Path: Delivered-To: apmail-incubator-cassandra-commits-archive@minotaur.apache.org Received: (qmail 95806 invoked from network); 6 Jan 2010 08:12:18 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 6 Jan 2010 08:12:18 -0000 Received: (qmail 26453 invoked by uid 500); 6 Jan 2010 08:12:18 -0000 Delivered-To: apmail-incubator-cassandra-commits-archive@incubator.apache.org Received: (qmail 26421 invoked by uid 500); 6 Jan 2010 08:12:18 -0000 Mailing-List: contact cassandra-commits-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: cassandra-dev@incubator.apache.org Delivered-To: mailing list cassandra-commits@incubator.apache.org Received: (qmail 26411 invoked by uid 99); 6 Jan 2010 08:12:18 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Jan 2010 08:12:18 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Jan 2010 08:12:16 +0000 Received: from brutus.apache.org (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 0EC0C234C052 for ; Wed, 6 Jan 2010 00:11:55 -0800 (PST) Message-ID: <403490776.63651262765515059.JavaMail.jira@brutus.apache.org> Date: Wed, 6 Jan 2010 08:11:55 +0000 (UTC) From: "Stu Hood (JIRA)" To: cassandra-commits@incubator.apache.org Subject: [jira] Commented: (CASSANDRA-674) New SSTable Format In-Reply-To: <920995237.63391262763234729.JavaMail.jira@brutus.apache.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797025#action_12797025 ] Stu Hood commented on CASSANDRA-674: ------------------------------------ List of features stubbed as "FIXME: not implemented" in v1: 1. Reverse slicing within CFs is not implemented (see SSTableSliceIterator), 2. Reading SuperColumns is disabled (see SSTable(Slice|Names)Iterator), 3. The recently added MMAP support for data files is disabled until I can port this SSTableScanner interface to use it (see SSTableReader), 4. AntiEntropyService is not hashing slices (meaning that major compactions always fail). 5. SSTable(Import|Export) are broken, 6. BinaryMemtables will crash on flush, 7. The bytesRead MBean for CompactionManager is disabled, 8. AntiCompaction is not using the 'skip ranges we don`t need' optimization. Also, I lied in the description above: the patch does not have GZIP compression enabled, but you can add two lines to enable it: add a GZIPInputStream to the chain in SSTableReader.Block.stream(), and a GZIPOutputStream to the chain in SSTableWriter.BlockContext.flushSlice(). There is a memory leak related to reading from compressed blocks which will quickly kill the server, but it should be easy to track down. Finally, there are tons of other TODOs/FIXMEs scattered around, many of which should be tackled in other tickets. > New SSTable Format > ------------------ > > Key: CASSANDRA-674 > URL: https://issues.apache.org/jira/browse/CASSANDRA-674 > Project: Cassandra > Issue Type: Improvement > Components: Core > Affects Versions: 0.9 > Reporter: Stu Hood > Assignee: Stu Hood > Fix For: 0.9 > > Attachments: 674-v1.diff > > > Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. The implementation has a bunch of issues/fixmes, which I'll describe in the comments. > The file format is described in the javadoc for the o.a.c.io.SSTableWriter class, but briefly: > * Blocks are opaque (except for their header) so that they can be compressed. The index file contains an entry for the first key in every Block. Blocks contain Slices. > * Slices are series of columns with the same parents and (deletion) metadata. They can be used to represent ColumnFamilies or SuperColumns (or a slice of columns at any other depth). A single CF can be split across multiple Slices, which can be split across multiple blocks. > * Neither Slices nor Blocks have a fixed size or maximum length, but they each have target lengths which can be stretched and broken by very large columns. > The most interesting concepts from this patch are: > * Block compression is possible (currently using GZIP, which has one bug mentioned in the comments), > * Compaction involves merging intersecting Slices from input SSTables. Since large rows will be broken down into multiple slices, only the portions of rows that intersect between tables need to be deserialized/merged/held-in-memory, > * Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block. > * Bloom filters for individual rows are gone, and the global filter contains ColumnKeys instead, meaning that a query for a column that doesn't exist in a row that does will often not need to seek to the row. > * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) for columns are defined recursively, so deeply nested slices are possible, > * Slices representing a single parent (CF, SC, etc) can have different Metadata, meaning that a tombstone Slice from d-f could sit between Slices containing columns a-c and g-h. This allows for eventually consistent range deletes of columns. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.