incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lior Golan <>
Subject RE: Damaged commit log disk causes Cassandra client to get stuck
Date Sun, 31 Jul 2011 21:40:01 GMT
Thanks Aaron. We will try to pull the logs and post them in this forum.

But what I don't understand is why the client should pause at all. We are writing with CL.ONE,
and the replication factor is 2. As far as we understand - the client communicates with a
certain node (any node for that matter) StorageProxy, which then sends write requests to all
2 replicas, but wait for just the first one of them to acknowledge the write.

So even if one node got stuck because of this commit log disk failure, it should not have
stuck the client. Can you explain why that ever happened in the first place?

And to add to that - when we took down the Cassandra node with the faulty commit log disk,
the client continued to write and didn't seem to bother (which is what we expected to happen
in the first place, but didn't).

From: aaron morton []
Sent: Monday, August 01, 2011 12:19 AM
Subject: Re: Damaged commit log disk causes Cassandra client to get stuck

A couple of timeouts should have kicked in.

First the rpc_timeout on the server side should have kicked in and given the client a (thrift)
TimedOutException. Secondly a client side socket timeout should be set so the client will
timeout the socket. Did either of these appear in the client side logs?

In response to either of those my guess would be that hector would cycle the connection. (I've
not checked this.)

How did the disk fail ? Was their anything in the server logs ?

Some background about handling disk fails


Aaron Morton
Freelance Cassandra Developer

On 1 Aug 2011, at 08:13, Lior Golan wrote:

In one of our test clusters we had a damaged commit log disks in one of the nodes.

We have replication factor = 2 in this cluster, and write with consistency level = ONE. So
we expected writes will not be affected by such an issue. But what actually happened is that
the client that was writing with CL.ONE got stuck. The client could resume writing when we
stopped the server with the faulty disk (so this is another indication it's not a replication
factor or consistency level issue).

We are running Cassandra 0.7.6, and the client we're using is Hector.

Can anyone explain what happened here? Why the client got stuck when the commit log disk on
one of the servers damaged (and could resume writing if we actually took off that server)?

View raw message