incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <>
Subject Re: [howto measure disk usage]
Date Sun, 15 May 2011 22:29:32 GMT
Sub columns for a super column do serialise their time stamp, they are just the same as regular
column. The super column does not have a timestamp of it's own. It does have it's own tombstone
marker though. 

Super Column does not take a huge amount more disk space, just the name a shot int, two ints
and a long int.

Some things to consider:

- were their any compacted files on disk ? these are sstables that have one zero length file
with COMPACTED in the name.  These files will be deleted at some point. 
- What did the commit log directory look like ? Flushing should have check pointed all the
log segments and deleted the log files. 
- I'm assuming this was a single node, if not was the node collecting Hinted 
- Did the standard CF have cache saving enabled ?

Take a poke around the /var/lib/cassandra tree and let us know if you see anything interesting.

Aaron Morton
Freelance Cassandra Developer

On 14 May 2011, at 03:15, Alexis Rodríguez wrote:

> cassandra-people,
> I'm trying to measure disk usage by cassandra after inserting some columns in order to
plan disk sizes and configurations for future deploys. 
> My approach is very straightforward:
> clean_data (stop_cassandra && rm -rf /var/lib/cassandra/{dara,commitlog,saved_caches}/*)
> perform_inserts
> measure_disk_usage (nodetool -flush && du -ch /var/lib/cassandra)
> There are two types of inserts:
> In a simple column with key, name and value a random string of size 100
> In a super-column with key, super-column-name, name and value a random string of size
> But surprisingly when I'm inserting 100 million columns on a simple column it uses more
disk than the same amount in a super-column. How can that be possible?
> For simple column 41984 MB and for super-column 29696, the difference is more than noticeable!
> Somebody told me yesterday that super-columns don't have a per-column timestamp, but...
it in my case, even if every data was in the same super-column-key it will not explain the
> ps: sorry, English is not my first language
> <results.eps>

View raw message