incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sameer Farooqui <cassandral...@gmail.com>
Subject Re: Data overhead discussion in Cassandra
Date Mon, 18 Jul 2011 17:53:44 GMT
Aaron,

That additional 15 bytes of overhead was the missing puzzle piece.

We had RF = 3.

So, now my calculations show that our CF should have a total of about 3.1 TB
of data and the actual figure is 3.3 TB (which might just be some stale
tombstones).

Thanks for the clarification about what else the index file contains, it
helps us justify the additional storage overhead.

- Sameer



On Sun, Jul 17, 2011 at 4:04 PM, aaron morton <aaron@thelastpickle.com>wrote:

> What RF are you using ?
>
> On disk each column has 15 bytes of overhead, plus the column name and the
> column value. So for an 8 byte long and a 8 byte double there will be 16
> bytes of data and 15 bytes of data.
>
> The index file also contains the the row key, the MD5 token (for RP) and
> the row offset for the data file.
>
> Cheers
>
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 15 Jul 2011, at 07:09, Sameer Farooqui wrote:
>
> > We just set up a demo cluster with Cassandra 0.8.1 with 12 nodes and
> loaded 1.5 TB of data into it. However, the actual space on disk being used
> by data files in Cassandra is 3 TB. We're using a standard column family
> with a million rows (key=string) and 35,040 columns per key. The column name
> is a long and the column value is a double.
> >
> > I was just hoping to understand more about why the data overhead is so
> large. We're not using expiring columns. Even considering indexing and bloom
> filters, it shouldn't have bloated up the data size to 2x the original
> amount. Or should it have?
> >
> > How can we better anticipate the actual data usage on disk in the future?
> >
> > - Sameer
>
>

Mime
View raw message