Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6192B5872 for ; Tue, 10 May 2011 14:17:34 +0000 (UTC) Received: (qmail 47899 invoked by uid 500); 10 May 2011 14:17:32 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 47881 invoked by uid 500); 10 May 2011 14:17:32 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 47871 invoked by uid 99); 10 May 2011 14:17:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 May 2011 14:17:32 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of sylvain@datastax.com designates 209.85.161.172 as permitted sender) Received: from [209.85.161.172] (HELO mail-gx0-f172.google.com) (209.85.161.172) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 May 2011 14:17:24 +0000 Received: by gxk19 with SMTP id 19so2755259gxk.31 for ; Tue, 10 May 2011 07:17:03 -0700 (PDT) MIME-Version: 1.0 Received: by 10.236.189.100 with SMTP id b64mr9414295yhn.411.1305037023285; Tue, 10 May 2011 07:17:03 -0700 (PDT) Received: by 10.146.86.9 with HTTP; Tue, 10 May 2011 07:17:03 -0700 (PDT) X-Originating-IP: [88.183.33.171] In-Reply-To: References: Date: Tue, 10 May 2011 16:17:03 +0200 Message-ID: Subject: Re: column bloat From: Sylvain Lebresne To: user@cassandra.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On Tue, May 10, 2011 at 3:44 PM, Terje Marthinussen wrote: > Hi, > If you make a supercolumn today, what you end up with is: > - short =A0+ "Super Column name" > - int (local deletion time) > - long (delete time) > Byte array of =A0columns each with: > =A0=A0- short + "column name" > =A0=A0- int (TTL) > =A0=A0- int (local deletion time) > =A0=A0- long (timestamp) > =A0=A0- int + "value of column" Almost but not exactly. First there is a 1 byte flag that is use to know if the column is a tombstone or an expiring one. Second, for tombstones the local deletion time is actually stored as part of the value, so we don't have that 'int (local deletion time)'. Third, We only serialize the TTL for expiring column, and in that case we serialize two int (the TTL and the local expiration time (which is maybe the one you called local deletion time above)). Anyway, to sum that up, expiring columns are 1 byte more and non-expiring ones are 7 bytes less. Not arguing, it's still fairly verbose, especially with tons of very small columns. > That is, meta data and serialization overhead adds up to: > 2+4+8 =3D 14 bytes for the supercolumn > 2+4+4+8+4 =3D 22 bytes for each column the supercolumn have > Yes, disk space is cheap and all that, but trying to handle a few billion > supercolumns which each have some 30-50 subcolumns, I am looking at some > 1.2-1.5TB of meta data which makes the metadata by itself some 3-4 times = the > orginal data. That does seem a bit excessive when you also throw in RF=3D= 3 and > the requirement for extra diskspace to safely survive compactions. > And yes, this is without considering the overhead of column names. > I can see a handful of way to reduce this quite a bit, for instance by: > - not adding TTL/deletion time if not needed (some compact bitmap structu= re > to turn on/off fields?) As said, we already do that. > - inherit timestamps from the supercolumn Columns inside a supercolumn have no reason to share the same timestamp (or even close ones for that matter). But maybe you're talking about something = more subtle, in which case yes there is ways to compress the data. > There may also be some interesting ways to compress this data assuming th= at > the timestamps are generally in the same time areas (shared "prefixes" > for=A0instance) , but that gets a bit more complex. > Any opinions or plans? There is and have been lots of discussion around this. The first ticket to tackle this is actually pretty old https://issues.apache.org/jira/browse/CASSANDRA-47 (= but you do know about this ticket). There's also talk about rewriting completel= y the file format (https://issues.apache.org/jira//browse/CASSANDRA-674). I don't take a lot of risk saying that at least some form of compression will happen some day. I wouldn't go as far as giving you a date though. -- Sylvain > Sorry, I could not find any JIRA's on the topic, but I guess I am not > surprised if it exists. > Yes, I could serialize this myself outside of cassandra, but that would s= ort > of defeat the purpose of using a more advanced storage system like > cassandra. > Regards, > Terje > > > > >