cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julie <julie.su...@nextcentury.com>
Subject Re: Cassandra disk space utilization WAY higher than I would expect
Date Fri, 23 Jul 2010 15:57:09 GMT
Jonathan Ellis <jbellis <at> gmail.com> writes:

> 
> then obsolete sstables is not your culprit.
> 

I believe I figured out how to force my node disk usage to go down.  I had been
letting Cassandra perform its own data management, and did not use nodetool to
force anything since in our real system, the data will need to be managed
automatically, without much human intervention.

But in my focused testing today I see that if I run nodetool "cleanup" on the
nodes taking up way more space than I expect, I see multiple SS Tables being
combined into 1 or 2 and the live disk usage going way down, down to what I know
the raw data requires.

This is great news!  I haven't tested it on hugely bloated nodes yet (where the
disk usage is 6X the size of the raw data) since I haven't reproduced that
problem today, but I would think using nodetool "cleanup" will work.

I just have two questions:

       (1) How can I set up Cassandra to do this automatically, to allow my
nodes to store more data? 
 
       (2) I am a bit confused why cleanup is working this way since the doc
claims it just cleans up keys no longer belonging to this node.  I have 8 nodes
and do a simple sequential write of 10,000 keys to each of them.  I'm using
random partitioning and give each node an Initial Token that should force even
spacing of tokens around the hash space:

# Create tokens for the RandomPartitioner that evenly divide token space
# The RandomPatitioner hashes keys into integer tokens in the range 0 to
# 2^127.
# So we simply divide that space into N equal sections.

for ((ii=1; ii<=serverCount; ii++)); do
    host=ec2-server$ii
    echo Generating InitialToken for server on $host
    token=$(bc<<-EOF
        ($ii*(2^127))/$serverCount
EOF)
    echo host=$host initialToken=$token
    echo "<InitialToken>$token</InitialToken>" >> storage-conf-node.xml
    cat storage-conf-node.xml

If tokens truly were being evenly distributed, I wouldn't think there would be a
plethora of keys to redistribute?  (All my rows are 1000Kb long, one column.) So
I'm not sure why cleanup is having this big of an effect on my disk space usage?

If you can tell me how to automate this and why it's working, I would love it.

Thanks for your help!
Julie






Mime
View raw message