cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim Ancona <>
Subject Re: Damaged commit log disk causes Cassandra client to get stuck
Date Tue, 02 Aug 2011 16:26:00 GMT
Sorry to follow-up to my own post but I just saw this issue: linked in a
neighboring thread (cassandra server disk full). It certainly implies
that a disk IO failure resulting in a "zombie" node is a possibility.


On Tue, Aug 2, 2011 at 4:19 PM, Jim Ancona <> wrote:
> Ideally, I would hope that a bad disk wouldn't hang a node but would
> instead just cause writes to fail, but if that is not the case,
> perhaps the bad disk somehow wedged that server node completely so
> that requests were not being processed at all (maybe not even being
> timed out). At that point you'd be depending on Hector's
> CassandraHostConfigurator.cassandraThriftSocketTimeout to expire,
> which would cause the request to fail over to a working node. But that
> value defaults to zero (i.e. forever), so if you didn't explicitly
> configure it your client would hang along with the server node.
> Perhaps someone with more knowledge of Cassandra's internals could
> comment on the possibility of the server hanging completely. I would
> think that the logs from the bad node might help to diagnose that.
> Jim
> On Sun, Jul 31, 2011 at 4:58 PM, aaron morton <> wrote:
>> Yup, it sounds like things may not have failed as their should. Do you have
>> a better definition of stuck ? Was the client waiting for a single request
>> to completed or was the client not cycling to another node ?
>>  If there is some server log details out it may help understand what
>> happened. Also what setting you had for  commitlog_sync in the yaml.  Also
>> some info on the failure, did the disk stop dead, or run slowly, or fail
>> sometimes etc.
>> AFAIK the wait on the writes to return should have timed out on the
>> coordinator. I may be behind on the expected behaviour, perhaps a thread
>> pool was shutdown as part of handling the error and this prevents the error
>> from returning.
>> I would check the rpc_timeout in the yaml, and that the client is setting a
>> client side socket time out. Timeouts should kick in. Then check the
>> expected behaviour for Hector in when it gets a timeout.
>> Cheers
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> On 1 Aug 2011, at 09:40, Lior Golan wrote:
>> Thanks Aaron. We will try to pull the logs and post them in this forum.
>> But what I don't understand is why the client should pause at all. We are
>> writing with CL.ONE, and the replication factor is 2. As far as we
>> understand – the client communicates with a certain node (any node for that
>> matter) StorageProxy, which then sends write requests to all 2 replicas, but
>> wait for just the first one of them to acknowledge the write.ii
>> So even if one node got stuck because of this commit log disk failure, it
>> should not have stuck the client. Can you explain why that ever happened in
>> the first place?
>> And to add to that – when we took down the Cassandra node with the faulty
>> commit log disk, the client continued to write and didn't seem to bother
>> (which is what we expected to happen in the first place, but didn't).
>> From: aaron morton []
>> Sent: Monday, August 01, 2011 12:19 AM
>> To:
>> Subject: Re: Damaged commit log disk causes Cassandra client to get stuck
>> A couple of timeouts should have kicked in.
>> First the rpc_timeout on the server side should have kicked in and given the
>> client a (thrift) TimedOutException. Secondly a client side socket timeout
>> should be set so the client will timeout the socket. Did either of these
>> appear in the client side logs?
>> In response to either of those my guess would be that hector would cycle the
>> connection. (I've not checked this.)
>> How did the disk fail ? Was their anything in the server logs ?
>> Some background about handling disk
>> fails
>> Cheers
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> On 1 Aug 2011, at 08:13, Lior Golan wrote:
>> In one of our test clusters we had a damaged commit log disks in one of the
>> nodes.
>> We have replication factor = 2 in this cluster, and write with consistency
>> level = ONE. So we expected writes will not be affected by such an issue.
>> But what actually happened is that the client that was writing with CL.ONE
>> got stuck. The client could resume writing when we stopped the server with
>> the faulty disk (so this is another indication it's not a replication factor
>> or consistency level issue).
>> We are running Cassandra 0.7.6, and the client we're using is Hector.
>> Can anyone explain what happened here? Why the client got stuck when the
>> commit log disk on one of the servers damaged (and could resume writing if
>> we actually took off that server)?

View raw message