cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Akhtar Hussain (JIRA)" <>
Subject [jira] [Reopened] (CASSANDRA-8352) Timeout Exception on Node Failure in Remote Data Center
Date Tue, 25 Nov 2014 11:17:12 GMT


Akhtar Hussain reopened CASSANDRA-8352:

Currently, it’s not possible for us to go for an immediate upgrade to 2.0.11. Moreover,
we are not certain whether it’s an issue with Cassandra version or a problem with our setup.

I would appreciate if you could try to reproduce the issue on Cassandra 2.0.3. Moreover, we
would like you to recheck our configuration. We are using private IP for rpc_address and Public
IP for seeds and listen_address. Is this configuration Ok? 

It’s very strange than inspite of using LOCAL_QUORUM for reads, we are getting org.apache.cassandra.thrift.TimedOutException:
null in our application logs. We are also getting Read timeout Exception in Cassandra logs
as only 5 out of 6 nodes responded when we killed one node. But Cassandra Exception is acceptable
if we don’t get Exception in Thrift. Please analyse the stacktrace we shared.

Steps to Reproduce:
1.	Setup two DCs with 3 nodes each
2.	Cassandra.yaml:
a.	Seeds= public  host names of 6 nodes (as configured in /etc/hosts)
b.	Listen_address= publi host name of node
c.	Rpc_address= private host name as configured in /etc/hosts
d.	Using vnodes
default= DC1:RAC1 (for DC1 nodes) / default= DC2 :RAC1 (for DC2 nodes)

host<n>_pub= public hostname
geo<n>_host= public hostname of nodes in remote DC

4.	Keyspace configuration
CREATE KEYSPACE vs WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'DC2': '3',
  'DC1': '3'
5.	Run traffic of 200 read request/sec on DC1. 

6.	Go to one node of DC2 and do kill -9 <cassandra pid>

7.	Read requests on DC1 fail temporarily.

> Timeout Exception on Node Failure in Remote Data Center
> -------------------------------------------------------
>                 Key: CASSANDRA-8352
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Unix, Cassandra 2.0.3
>            Reporter: Akhtar Hussain
>              Labels: DataCenter, GEO-Red
> We have a Geo-red setup with 2 Data centers having 3 nodes each. When we bring down a
single Cassandra node down in DC2 by kill -9 <Cassandra-pid>, reads fail on DC1 with
TimedOutException for a brief amount of time (15-20 sec~). 
> Questions:
> 1.	We need to understand why reads fail on DC1 when a node in another DC i.e. DC2 fails?
As we are using LOCAL_QUORUM for both reads/writes in DC1, request should return once 2 nodes
in local DC have replied instead of timing out because of node in remote DC.
> 2.	We want to make sure that no Cassandra requests fail in case of node failures. We
used rapid read protection of ALWAYS/99percentile/10ms as mentioned in
But nothing worked. How to ensure zero request failures in case a node fails?
> 3.	What is the right way of handling HTimedOutException exceptions in Hector?
> 4.	Please confirm are we using public private hostnames as expected?
> We are using Cassandra 2.0.3.

This message was sent by Atlassian JIRA

View raw message