One more piece of information to help troubleshooting the issue:

During the "nodetool drain" operation just before the upgrade, instead of just stopping accepting new writes, the node actually shuts itself down. This bug was also reported in this other thread: http://mail-archives.apache.org/mod_mbox/cassandra-user/201303.mbox/%3CCAFDWQMTrYm7hBxXKoW8+eVKfNE6zvjW2h8_BSVGmOL7=gRDtLw@mail.gmail.com%3E

Since I started Cassandra 1.2 only a few seconds before cassandra 1.1 died (after the nodetool drain), I'm afraid there wasn't sufficient time for the remaining nodes to update the metadata about the "downed" node. So when the upgraded node was restarted, the metadata in the other nodes was still referring to the previous version of the same node, so this may have caused the handshake problem, and consequently the read timeout. Does that theory make sense?


2013/10/4 Robert Coli <rcoli@eventbrite.com>
On Fri, Oct 4, 2013 at 9:09 AM, Paulo Motta <pauloricardomg@gmail.com> wrote:
I manually tried to insert and retrieve some data into both the newly upgraded nodes and the old nodes, and the behavior was very unstable: sometimes it worked, sometimes it didn't (TimedOutException), so I don't think it was a network problem.

The number of read timeouts diminished as the number of upgraded nodes increased, until it reached stability. The logs were showing the following messages periodically:

... 
Both of these issues relate to upgrading from 1._0_.x to 1.2.x, which is not supported.

Were I you, I would summarize the above experience in a JIRA ticket, as 1.1.x to 1.2.x should be a supported operation and should not unexpectedly result in decreased availability during the upgrade.

=Rob 



--
Paulo Ricardo

--
European Master in Distributed Computing
Royal Institute of Technology - KTH
Instituto Superior Técnico - IST
http://paulormg.com