cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Boule <>
Subject State of Cassandra-Shuffle (1.2.x)
Date Mon, 17 Jun 2013 15:37:26 GMT
A bit of background:

We are in Beta, we have a very small (2 node) cluster that we created with 1.2.1.  Being new
to this we did not enable vnodes, and we got bit hard by the default token generation in production
after setting up lots of development & QA clusters without running into the problem. 
 We ended up with like 97.5% of the tokens belonging to one of the two nodes.   The good thing
is even one Cassandra node is doing OK right now with our load.   The bad thing of course
is we still would rather it be balanced.   There is only about 120GB of data.

We would like to upgrade this cluster to vNodes.. we first tried doing this on 1.2.1, it did
not work due to the bug where the shuffle job inserted a corrupted row into the system.range_xfers
column family.   Last week I talked to several people at the summit and it was recommended
we try this with 1.2.5.

I have a test cluster I am trying to run this procedure on,  I set it up with 1 token per
node, then upgrade it to vnodes, then I upgraded it to 1.2.5 with no problems friday, and
let it run over the weekend.  All appeared to be well when I left, there were something like
500 total relocations generated, and it had chugged through ~100 of them after an hour or
so and it looked like it was heading towards being balanced.

----@ip-10-10-1-160:/var/lib/cassandra/data/Keyspace1/Standard1/snapshots# nodetool status
Datacenter: datacenter1
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens  Owns   Host ID                               Rack
UN  1.02 GB    254     66.8%  6d500bc6-95fb-47a3-afb5-c283c4f3de03  rack1
UN  1.1 GB     258     33.2%  186d99b8-9fde-4e50-959a-6fba6098fba6  rack1

When I came in to work today (Monday), there were 189 relocations to go, and this is what
the status looks like.

----@ip-10-10-1-160:/var/lib/cassandra/data/Keyspace1/Standard1/snapshots# nodetool status
Datacenter: datacenter1
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens  Owns   Host ID                               Rack
UN  48.11 GB   231     38.7%  6d500bc6-95fb-47a3-afb5-c283c4f3de03  rack1
UN  34.5 GB    281     61.3%  186d99b8-9fde-4e50-959a-6fba6098fba6  rack1

An hour later and now it looks like this:

-----@ip-10-10-1-160:/tmp# nodetool status
Datacenter: datacenter1
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens  Owns   Host ID                               Rack
UN  11.61 GB   231     38.7%  6d500bc6-95fb-47a3-afb5-c283c4f3de03  rack1
UN  931.45 MB  281     61.3%  186d99b8-9fde-4e50-959a-6fba6098fba6  rack1

I did notice that it had fallen behind on compaction while this was running.

-----@ip-10-10-1-161:~$ nodetool compactionstats
pending tasks: 6
          compaction type        keyspace   column family       completed           total
     unit  progress
               Compaction       Keyspace1       Standard1      1838641124      5133428315
    bytes    35.82%
               Compaction       Keyspace1       Standard1      2255463423      5110283630
    bytes    44.14%
Active compaction remaining time :   0h06m06s

The reduction in disk space did seem to correspond with about half of the compaction jobs
finishing.   It seems to bounce up and down as it runs, consuming huge amounts of space and
then freeing it up.

My question is what can we expect out of this job?  Should it really be working?   Do we need
to expect it to waste 70-100x disk space while it runs?   Are there compaction options we
can set ahead of time to minimize the penalty here?  What is the expected extra space consumed
while it runs, what is the expected extra space consumed when it is done?  Note that in my
test cluster, I used a keyspace created by cassandra-stress, it uses the default compaction
settings, which is SizeTiered and whatever the default thresholds are.   In our real cluster,
we did configure compaction.

Our original plan when the job didn't work against 1.2.1 was to bring up a new cluster along
side the old one, that was pre-configured for vNodes, and then migrate our data out of the
old cluster into the new cluster.  Obviously this requires us to write our own software to
do the migration.   We are going to size up the new cluster as well and update the schema,
so it's not a total waste, but we would have liked to be able to balance the load on the original
cluster in the mean time.

Any advice?  We are planning to migrate to 2.0 later this summer but probably don't want to
build it from the beta source ourself right now.

Thank you,
Ben Boule
This electronic message contains information which may be confidential or privileged. The
information is intended for the use of the individual or entity named above. If you are not
the intended recipient, be aware that any disclosure, copying, distribution or use of the
contents of this information is prohibited. If you have received this electronic transmission
in error, please notify us by e-mail at ( immediately.

View raw message