Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of sylvain@datastax.com
 designates 209.85.161.172 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <BANLkTi=g5njf-qMNU3ogdjx+Wa_wZ6H8pg@mail.gmail.com>
References: <BANLkTi=g5njf-qMNU3ogdjx+Wa_wZ6H8pg@mail.gmail.com>
Date: Tue, 10 May 2011 16:17:03 +0200
Message-ID: <BANLkTi=K7eqRBY8NAN=64RPeY746818Rvg@mail.gmail.com>
Subject: Re: column bloat
From: Sylvain Lebresne <sylvain@datastax.com>
To: user@cassandra.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Tue, May 10, 2011 at 3:44 PM, Terje Marthinussen
<tmarthinussen@gmail.com> wrote:
> Hi,
> If you make a supercolumn today, what you end up with is:
> - short =A0+ "Super Column name"
> - int (local deletion time)
> - long (delete time)
> Byte array of =A0columns each with:
> =A0=A0- short + "column name"
> =A0=A0- int (TTL)
> =A0=A0- int (local deletion time)
> =A0=A0- long (timestamp)
> =A0=A0- int + "value of column"

Almost but not exactly. First there is a 1 byte flag that is use to
know if the column is
a tombstone or an expiring one. Second, for tombstones the local
deletion time is actually
stored as part of the value, so we don't have that 'int (local
deletion time)'. Third, We
only serialize the TTL for expiring column, and in that case we
serialize two int (the TTL
and the local expiration time (which is maybe the one you called local
deletion time above)).

Anyway, to sum that up, expiring columns are 1 byte more and
non-expiring ones are 7 bytes
less. Not arguing, it's still fairly verbose, especially with tons of
very small columns.

> That is, meta data and serialization overhead adds up to:
> 2+4+8 =3D 14 bytes for the supercolumn
> 2+4+4+8+4 =3D 22 bytes for each column the supercolumn have
> Yes, disk space is cheap and all that, but trying to handle a few billion
> supercolumns which each have some 30-50 subcolumns, I am looking at some
> 1.2-1.5TB of meta data which makes the metadata by itself some 3-4 times =
the
> orginal data. That does seem a bit excessive when you also throw in RF=3D=
3 and
> the requirement for extra diskspace to safely survive compactions.
> And yes, this is without considering the overhead of column names.
> I can see a handful of way to reduce this quite a bit, for instance by:
> - not adding TTL/deletion time if not needed (some compact bitmap structu=
re
> to turn on/off fields?)

As said, we already do that.

> - inherit timestamps from the supercolumn

Columns inside a supercolumn have no reason to share the same timestamp (or
even close ones for that matter). But maybe you're talking about something =
more
subtle, in which case yes there is ways to compress the data.

> There may also be some interesting ways to compress this data assuming th=
at
> the timestamps are generally in the same time areas (shared "prefixes"
> for=A0instance) , but that gets a bit more complex.
> Any opinions or plans?

There is and have been lots of discussion around this. The first
ticket to tackle this
is actually pretty old https://issues.apache.org/jira/browse/CASSANDRA-47 (=
but
you do know about this ticket). There's also talk about rewriting completel=
y the
file format (https://issues.apache.org/jira//browse/CASSANDRA-674).

I don't take a lot of risk saying that at least some form of
compression will happen
some day. I wouldn't go as far as giving you a date though.

--
Sylvain

> Sorry, I could not find any JIRA's on the topic, but I guess I am not
> surprised if it exists.
> Yes, I could serialize this myself outside of cassandra, but that would s=
ort
> of defeat the purpose of using a more advanced storage system like
> cassandra.
> Regards,
> Terje
>
>
>
>
>