Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 80F334DE1 for ; Sun, 15 May 2011 22:30:07 +0000 (UTC) Received: (qmail 31032 invoked by uid 500); 15 May 2011 22:30:04 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 31006 invoked by uid 500); 15 May 2011 22:30:04 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 30994 invoked by uid 99); 15 May 2011 22:30:04 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 15 May 2011 22:30:04 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a58.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 15 May 2011 22:29:58 +0000 Received: from homiemail-a58.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a58.g.dreamhost.com (Postfix) with ESMTP id 50C227D805B for ; Sun, 15 May 2011 15:29:36 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; q=dns; s=thelastpickle.com; b=gBLiakOAeD yyJt2TrsuT81koTjY5GDevz0sqcuZFrExhaDT+5hcpXLjjQnDs6n6rDw8fyuEHAO MrEYeiceULPkZ/mCPwCn7Y/zRYzvs0BpccrR7opEJcndpvbBJxz45rZ+1sN1iJtB yRh735IjAwTAPAFTakYJv3k2qKmkvBfiE= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; s=thelastpickle.com; bh=kNXFQ9CxYWQGUrDf b39OJThtyOg=; b=q1Idilbuh7Ctzt7nxB1icHVPl03t2Oz+zIjflEb8rN0D+sa3 jRO5WBLnxwp+HL24huGIljVr6IFkRxUoTmEMV9SlG3EgSihDq3PhLwrU1xdpDUr2 GsYrHVUwpFkV3+tM+ODAN2A+djoWzzetXCrPfNB0PM0lEaJWgBeFmNnOCBg= Received: from [10.0.1.151] (121-73-157-230.cable.telstraclear.net [121.73.157.230]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a58.g.dreamhost.com (Postfix) with ESMTPSA id 3EAD57D8058 for ; Sun, 15 May 2011 15:29:35 -0700 (PDT) From: aaron morton Mime-Version: 1.0 (Apple Message framework v1084) Content-Type: multipart/alternative; boundary=Apple-Mail-2-773858108 Subject: Re: [howto measure disk usage] Date: Mon, 16 May 2011 10:29:32 +1200 In-Reply-To: To: user@cassandra.apache.org References: Message-Id: <4515F3DC-0A43-441F-8883-938BE01F79B5@thelastpickle.com> X-Mailer: Apple Mail (2.1084) --Apple-Mail-2-773858108 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=iso-8859-1 Sub columns for a super column do serialise their time stamp, they are = just the same as regular column. The super column does not have a = timestamp of it's own. It does have it's own tombstone marker though.=20 Super Column does not take a huge amount more disk space, just the name = a shot int, two ints and a long int. Some things to consider: - were their any compacted files on disk ? these are sstables that have = one zero length file with COMPACTED in the name. These files will be = deleted at some point.=20 - What did the commit log directory look like ? Flushing should have = check pointed all the log segments and deleted the log files.=20 - I'm assuming this was a single node, if not was the node collecting = Hinted=20 - Did the standard CF have cache saving enabled ? Take a poke around the /var/lib/cassandra tree and let us know if you = see anything interesting.=20 Cheers =20 ----------------- Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 14 May 2011, at 03:15, Alexis Rodr=EDguez wrote: > cassandra-people, >=20 > I'm trying to measure disk usage by cassandra after inserting some = columns in order to plan disk sizes and configurations for future = deploys.=20 >=20 > My approach is very straightforward: >=20 > clean_data (stop_cassandra && rm -rf = /var/lib/cassandra/{dara,commitlog,saved_caches}/*) > perform_inserts > measure_disk_usage (nodetool -flush && du -ch /var/lib/cassandra) >=20 > There are two types of inserts: > In a simple column with key, name and value a random string of size = 100 > In a super-column with key, super-column-name, name and value a random = string of size 100 > But surprisingly when I'm inserting 100 million columns on a simple = column it uses more disk than the same amount in a super-column. How can = that be possible? >=20 > For simple column 41984 MB and for super-column 29696, the difference = is more than noticeable! >=20 > Somebody told me yesterday that super-columns don't have a per-column = timestamp, but... it in my case, even if every data was in the same = super-column-key it will not explain the difference! >=20 >=20 > ps: sorry, English is not my first language >=20 >=20 > --Apple-Mail-2-773858108 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=iso-8859-1 Sub = columns for a super column do serialise their time stamp, they are just = the same as regular column. The super column does not have a timestamp = of it's own. It does have it's own tombstone marker = though. 

Super Column does not take a huge = amount more disk space, just the name a shot int, two ints and a long = int.

Some things to = consider:

- were their any compacted files on = disk ? these are sstables that have one zero length file with COMPACTED = in the name.  These files will be deleted at some = point. 
- What did the commit log directory look like ? = Flushing should have check pointed all the log segments and deleted the = log files. 
- I'm assuming this was a single node, if not = was the node collecting Hinted 
- Did the standard CF = have cache saving enabled ?

Take a poke around = the /var/lib/cassandra tree and let us know if you see anything = interesting. 

Cheers
  <= /div>
http://www.thelastpickle.com

On 14 May 2011, at 03:15, Alexis Rodr=EDguez = wrote:

cassandra-people,

I'm trying to measure disk usage by cassandra after inserting some = columns in order to plan disk sizes and configurations for future = deploys. 

My = approach is very straightforward:

clean_data (stop_cassandra && rm = -rf /var/lib/cassandra/{dara,commitlog,saved_caches}/*)
perform_inserts
measure_disk_usage (nodetool -flush && du -ch = /var/lib/cassandra)

There are two types of = inserts:
  • In a simple column with key, name and value = a random string of size 100
  • In a = super-column with key, super-column-name, name and value a random string = of size 100
But surprisingly when I'm inserting 100 = million columns on a simple column it uses more disk than the same = amount in a super-column. How can that be possible?

For simple column 41984 MB and for = super-column 29696, the difference is more than noticeable!

Somebody told me yesterday that = super-columns don't have a per-column timestamp, but... it in my case, = even if every data was in the same super-column-key it will not explain = the difference!


ps: sorry, = English is not my first language


= <results.eps>

= --Apple-Mail-2-773858108--