Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5EFEE7913 for ; Mon, 18 Jul 2011 17:54:16 +0000 (UTC) Received: (qmail 80801 invoked by uid 500); 18 Jul 2011 17:54:14 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 80737 invoked by uid 500); 18 Jul 2011 17:54:13 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 80725 invoked by uid 99); 18 Jul 2011 17:54:13 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 18 Jul 2011 17:54:13 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of cassandralabs@gmail.com designates 209.85.210.172 as permitted sender) Received: from [209.85.210.172] (HELO mail-iy0-f172.google.com) (209.85.210.172) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 18 Jul 2011 17:54:05 +0000 Received: by iye7 with SMTP id 7so3601447iye.31 for ; Mon, 18 Jul 2011 10:53:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=MlXh/m3D47JWmN0o3EsdPJFgxVC/qB9g6IXnoSI9qDE=; b=rmDmdJVUr6pqLTykl9pAUct5SgvIehpPjEdS73EjylRU78ofrLr9HuaY2nbhVSV43r 9uvia1lV5f4gUiIiKNkPGZxvQle4+ATXCpGnAW+nSS25s687TVdUZ7XUwMgLFCwtfUTp vHlMeo89NSktGePnUcV79I3SRdLwxhftyl7wo= MIME-Version: 1.0 Received: by 10.231.47.2 with SMTP id l2mr6147182ibf.174.1311011624771; Mon, 18 Jul 2011 10:53:44 -0700 (PDT) Received: by 10.231.3.83 with HTTP; Mon, 18 Jul 2011 10:53:44 -0700 (PDT) In-Reply-To: <9141AEDC-1C11-4105-B578-3C16B5991C2A@thelastpickle.com> References: <9141AEDC-1C11-4105-B578-3C16B5991C2A@thelastpickle.com> Date: Mon, 18 Jul 2011 10:53:44 -0700 Message-ID: Subject: Re: Data overhead discussion in Cassandra From: Sameer Farooqui To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=00151774106430e34104a85baedf X-Virus-Checked: Checked by ClamAV on apache.org --00151774106430e34104a85baedf Content-Type: text/plain; charset=ISO-8859-1 Aaron, That additional 15 bytes of overhead was the missing puzzle piece. We had RF = 3. So, now my calculations show that our CF should have a total of about 3.1 TB of data and the actual figure is 3.3 TB (which might just be some stale tombstones). Thanks for the clarification about what else the index file contains, it helps us justify the additional storage overhead. - Sameer On Sun, Jul 17, 2011 at 4:04 PM, aaron morton wrote: > What RF are you using ? > > On disk each column has 15 bytes of overhead, plus the column name and the > column value. So for an 8 byte long and a 8 byte double there will be 16 > bytes of data and 15 bytes of data. > > The index file also contains the the row key, the MD5 token (for RP) and > the row offset for the data file. > > Cheers > > ----------------- > Aaron Morton > Freelance Cassandra Developer > @aaronmorton > http://www.thelastpickle.com > > On 15 Jul 2011, at 07:09, Sameer Farooqui wrote: > > > We just set up a demo cluster with Cassandra 0.8.1 with 12 nodes and > loaded 1.5 TB of data into it. However, the actual space on disk being used > by data files in Cassandra is 3 TB. We're using a standard column family > with a million rows (key=string) and 35,040 columns per key. The column name > is a long and the column value is a double. > > > > I was just hoping to understand more about why the data overhead is so > large. We're not using expiring columns. Even considering indexing and bloom > filters, it shouldn't have bloated up the data size to 2x the original > amount. Or should it have? > > > > How can we better anticipate the actual data usage on disk in the future? > > > > - Sameer > > --00151774106430e34104a85baedf Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Aaron,

That additional 15 bytes of overhead was the missing puzzle p= iece.

We had RF =3D 3.

So, now my calculations show that our = CF should have a total of about 3.1 TB of data and the actual figure is 3.3= TB (which might just be some stale tombstones).

Thanks for the clarification about what else the index file contains, i= t helps us justify the additional storage overhead.

- Sameer


On Sun, Jul 17, 2011 at 4:04 PM, aaron = morton <aar= on@thelastpickle.com> wrote:
What RF are you using ?

On disk each column has 15 bytes of overhead, plus the column name and the = column value. So for an 8 byte long and a 8 byte double there will be 16 by= tes of data and 15 bytes of data.

The index file also contains the the row key, the MD5 token (for RP) and th= e row offset for the data file.

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thela= stpickle.com

On 15 Jul 2011, at 07:09, Sameer Farooqui wrote:

> We just set up a demo cluster with Cassandra 0.8.1 with 12 nodes and l= oaded 1.5 TB of data into it. However, the actual space on disk being used = by data files in Cassandra is 3 TB. We're using a standard column famil= y with a million rows (key=3Dstring) and 35,040 columns per key. The column= name is a long and the column value is a double.
>
> I was just hoping to understand more about why the data overhead is so= large. We're not using expiring columns. Even considering indexing and= bloom filters, it shouldn't have bloated up the data size to 2x the or= iginal amount. Or should it have?
>
> How can we better anticipate the actual data usage on disk in the futu= re?
>
> - Sameer


--00151774106430e34104a85baedf--