cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Wille <>
Subject Observations/concerns with repair and hinted handoff
Date Tue, 09 Dec 2014 17:41:11 GMT
I have spent a lot of time working with single-node, RF=1 clusters in my development. Before
I deploy a cluster to our live environment, I have spent some time learning how to work with
a multi-node cluster with RF=3. There were some surprises. I’m wondering if people here
can enlighten me. I don’t exactly have that warm, fuzzy feeling.

I created a three-node cluster with RF=3. I then wrote to the cluster pretty heavily to cause
some dropped mutation messages. The dropped messages didn’t trickle in, but came in a burst.
I suspect full GC is the culprit, but I don’t really know. Anyway, I ended up with 17197
dropped mutation messages on node 1, 6422 on node 2, and none on node 3. In order to learn
about repair, I waited for compaction to finish doing its thing, recorded the size and estimated
number of keys for each table, started up repair (nodetool repair <keyspace>) on all
three nodes, and waited for it to complete before doing anything else (even reads). When repair
and compaction were done, I checked the size and estimated number of keys for each table.
All tables on all nodes grew in size and estimated number of keys. The estimated number of
keys for each node grew by 65k, 272k and 247k (.2%, .7% and .6%) for nodes 1, 2 and 3 respectively.
I expected some growth, but that’s significantly more new keys than I had dropped mutation
messages. I also expected the most new data on node 1, and none on node 3, which didn’t
come close to what actually happened. Perhaps a mutation message contains more than one record?
Perhaps the dropped mutation message counter is incremented on the coordinator, not the node
that was overloaded?

I repeated repair, and the second time around the tables remained unchanged, as expected.
I would hope that repair wouldn’t do anything to the tables if they were in sync. 

Just to be clear, I’m not overly concerned about the unexpected increase in number of keys.
I’m pretty sure that repair did the needful thing and did bring the nodes in sync. The unexpected
results more likely indicates that I’m ignorant, and it really bothers me when I don’t
understand something. If you have any insights, I’d appreciate them.

One of the dismaying things about repair was that the first time around it took about 4 hours,
with a completely idle cluster (except for repairs, of course), and only 6 GB of data on each
node. I can bootstrap a node with 6 GB of data in a couple of minutes. That makes repair something
like 50 to 100 times more expensive than bootstrapping. I know I should run repair on one
node at a time, but even if you divide by three, that’s still a horrifically long time for
such a small amount of data. The second time around, repair only took 30 minutes. That’s
much better, but best-case is still about 10x longer than bootstrapping. Should repair really
be taking this long? When I have 300 GB of data, is a best-case repair going to take 25 hours,
and a repair with a modest amount of work more than 100 hours? My records are quite small.
Those 6 GB contain almost 40 million partitions. 

Following my repair experiment, I added a fourth node, and then tried killing a node and importing
a bunch of data while the node was down. As far as repair is concerned, this seems to work
fine (although again, glacially). However, I noticed that hinted handoff doesn’t seem to
be working. I added several million records (with consistency=one), and nothing appeared in
system.hints (du -hs showed a few dozen K bytes), nor did I get any pending Hinted Handoff
tasks in the Thread Pool Stats. When I started up the down node (less than 3 hours later),
the missed data didn’t appear to get sent to it. The tables did not grow, compaction events
didn’t schedule, and there wasn’t any appreciable CPU utilization by the cluster. With
millions of records that were missed while it was down, I should have noticed something if
it actually was replaying the hints. Is there some magic setting to turn on hinted handoffs?
Were there too many hints and so it just deleted them? My assumption is that if hinted handoff
is working, then my need for repair should be much less, which given my experience so far,
would be a really good thing.

Given the horrifically long time it takes to repair a node, and hinted handoff apparently
not working, if a node goes down, is it better to bootstrap a new one than to repair the node
that went down? I would expect that even if I chose to bootstrap a new node, it would need
to be repaired anyway, since it would probably miss writes while bootstrapping.

Thanks in advance


View raw message