From Robert Wille <>
Subject Re: Dropped mutation messages
Date Mon, 15 Jun 2015 17:14:45 GMT
My primary concern isn’t so much that I am dropping mutation messages, but the fact that
I don’t get errors when it happens. According to my understanding, a dropped mutation message
should always result in an error thrown back to the client when RF=1.

I looked at my logs to see what was going on when the dropped mutation messages occurred.
Two nodes in my cluster dropped a bunch of messages at roughly the same time, and a third
node dropped messages about two minutes later. No GC messages in the logs at that time on
any node. Nothing else of interest in the logs around that time.

The logs showed the dropped mutation messages to be much older than I assumed they would be.
I assumed they would be from when I was doing heavy migration, but they are from when I first
started using the cluster.

For unknown reasons, my writes spiked heavily at the time the messages were dropped:


My data migration tool is heavily threaded, and perhaps there was a bug in the code that limits
the concurrency, and I fixed it without realizing it. I really have no reasonable explanation
for that spike. Whatever the cause, the spike caused a lot of GC, but not enough to produce
any GC events in the logs. My guess is that there were so many writes that they simply took
too long and timed out. But again, it is very disturbing that the client never knew about

The good news is that I haven’t dropped mutation messages since I began migrating data in
earnest, and I’ve imported close to a billion records so far. I should probably crank up
the threading on my migration code to force errors on the cluster and see what happens. If
I can reproduce the dropped messages without getting a timeout in the client, then perhaps
I can file a jira.

Thanks to those that responded, and hopefully the information from my logs and OpsCenter will
help people that are following this thread, or that may stumble across it in the future.


On Jun 13, 2015, at 12:09 PM, Anuj Wadehra <<>>

U said RF=1...missed not sure eventual consistency is creating issues..

Anuj Wadehra

Sent from Yahoo Mail on Android<>

From:"Anuj Wadehra" <<>>
Date:Sat, 13 Jun, 2015 at 11:31 pm
Subject:Re: Dropped mutation messages

I think the messages dropped are the asynchronous ones required to maintain eventual consistency.
Client may not be complaining as the data gets commited to one node synchronously..but dropped
when sent to other nodes asynchronously..

We resolved similar issue in our cluster by increasing memtable_flush_writers to 3 from 1
( we were writing to multiple cf simultaneously).

We also fixed GC issues and reduced total_memtable_size_in_mb to ensure that most memtables
are flushed early in heavy write loads.

Anuj Wadehra

Sent from Yahoo Mail on Android<>

From:"Robert Wille" <<>>
Date:Sat, 13 Jun, 2015 at 8:29 pm
Subject:Re: Dropped mutation messages

Internode messages which are received by a node, but do not get not to be processed within
rpc_timeout are dropped rather than processed. As the coordinator node will no longer be waiting
for a response. If the Coordinator node does not receive Consistency Level responses before
the rpc_timeout it will return a TimedOutException to the client.

I understand that, but that’s where this makes no sense. I’m running with RF=1, and CL=QUORUM,
which means each update goes to one node, and I need one response for a success. I have many
thousands of dropped mutation messages, but no TimedOutExceptions thrown back to the client.
If I have GC problems, or other issues that are making my cluster unresponsive, I can deal
with that. But having writes that fail and no error is clearly not acceptable. How is it possible
to be getting errors and not be informed about them?



