cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Ellis <>
Subject Re: Cassandra disk space utilization WAY higher than I would expect
Date Sat, 31 Jul 2010 03:29:46 GMT
did you try compact instead of cleanup, anyway?

On Tue, Jul 27, 2010 at 1:08 PM, Julie <> wrote:
> Peter Schuller <peter.schuller <at>> writes:
>> > a) cleanup is a superset of compaction, so if you've been doing
>> > overwrites at all then it will reduce space used for that reason
> Hi Peter and Jonathan,
> In my test, I write 80,000 rows (100KB each row) to an 8 node cluster.  The
> 80,000 rows all have unique keys '1' through '80000' so no overwriting is
> occurring.  I also don't do any deletes.  I simply write the 80,000 rows to
> the 8 node cluster which should be about 1GB of data times 3 (replication
> factor=3) on each node.
> The only thing I am doing special, is I use Random Partitioning and set the
> Initial Token on each node to try to get the data evenly distributed:
>    # Create tokens for the RandomPartitioner that evenly divide token space
>    # The RandomPatitioner hashes keys into integer tokens in the range 0 to
>    # 2^127.
>    # So we simply divide that space into N equal sections.
>    # serverCount = the number of Cassandra nodes in the cluster
>    for ((ii=1; ii<=serverCount; ii++)); do
>        host=ec2-server$ii
>        echo Generating InitialToken for server on $host
>        token=$(bc<<-EOF
>            ($ii*(2^127))/$serverCount
>    EOF)
>        echo host=$host initialToken=$token
>        echo "<InitialToken>$token</InitialToken>" >> storage-conf-node.xml
>        cat storage-conf-node.xml
>    ...
>    done
> 24 hours after my writes, the data is evenly distributed according to
> cfstats (I see almost identical numRows from node to node) but there is
> a lot of extra disk space being used on some nodes, again according to
> cfstats.  This disk usage drops back down to 2.7GB (exactly what I expect
> since that's how much raw data is on each node) when I run "nodetool
> cleanup".
> I am confused why there is anything to clean up 24 hours after my last
> write? All nodes in the cluster are fully up and aware of each other
> before I begin the writes.  The only other thing that could possibly be
> considered unusual is I cycle through all 8 nodes, rather than
> communicating with a single Cassandra node.  I use a write consistency
> setting of ALL.  I can't see how these would increase the amount of disk
> space used but just mentioning it.
> Any help would be greatly appreciated,
> Julie
> Peter Schuller <peter.schuller <at>> writes:
>> > a) cleanup is a superset of compaction, so if you've been doing
>> > overwrites at all then it will reduce space used for that reason
>> I had failed to consider over-writes as a possible culprit (since
>> removals were stated not to be done). However thinking about it I
>> believe the effect of this should be limited to roughly a doubling of
>> disk space in the absolute worst case of over-writing all data in the
>> absolute worst possible order (such as writing everything twice in the
>> same order).
>> Or more accurately, it should be limited to wasting as much as space
>> as the size of the overwritten values. If you're overwriting with
>> larger values, it will no longer be a "doubling" relative to the
>> actual live data set.
>> Julie, did you do over-writes or was your disk space measurements
>> based on the state of the cluster after an initial set of writes of
>> unique values?

Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support

View raw message