incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paulo Motta <>
Subject Increased read timeouts during rolling upgrade to C* 1.2
Date Fri, 04 Oct 2013 16:09:50 GMT

I have isolated one of our data centers to simulate a rolling restart
upgrade from C* 1.1.10 to 1.2.10. We replayed our production traffic to the
C* nodes during the upgrade and observed an increased number of read
timeouts during the upgrade process.

I executed nodetool drain before upgrading each node, and during the
upgrade "nodetool ring" was showing that node as DOWN, as expected. After
each upgrade all nodes were showing the upgraded node as UP, so apparently
all nodes were communicating fine.

I manually tried to insert and retrieve some data into both the newly
upgraded nodes and the old nodes, and the behavior was very unstable:
sometimes it worked, sometimes it didn't (TimedOutException), so I don't
think it was a network problem.

The number of read timeouts diminished as the number of upgraded nodes
increased, until it reached stability. The logs were showing the following
messages periodically:

 INFO [HANDSHAKE-/10.176.249.XX] 2013-10-03 17:36:16,948 (line 399) Handshaking version with
 INFO [HANDSHAKE-/10.176.182.YY] 2013-10-03 17:36:17,280 (line 408) Cannot handshake version with
 INFO [HANDSHAKE-/10.176.182.YY] 2013-10-03 17:36:17,280 (line 399) Handshaking version with
 INFO [HANDSHAKE-/10.188.13.ZZ] 2013-10-03 17:36:17,510 (line 408) Cannot handshake version with
 INFO [HANDSHAKE-/10.188.13.ZZ] 2013-10-03 17:36:17,511 (line 399) Handshaking version with /10.188.13.ZZ
DEBUG [WRITE-/54.215.70.YY] 2013-10-03 18:01:50,237 (line 338) Target max version is -2147483648; no
version information yet, will retry
TRACE [HANDSHAKE-/10.177.14.XX] 2013-10-03 18:01:50,237 (line 406) Cannot handshake version with

Another fact is that the number of completed compaction tasks decreased as
the number of upgraded nodes increased. I don't know if that's related to
the increased number read timeouts or just a coincidence. The timeout
configuration is the default (10000ms).

Two similar issues were reported, but without satisfactory responses:


Is that an expected behavior or is there something that might be going
wrong during the upgrade? Has anyone faced similar issues?

Any help would be very much appreciated.



View raw message