Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of alienth@gmail.com designates
 209.85.160.188 as permitted sender)
MIME-Version: 1.0
Date: Tue, 12 Apr 2011 09:30:56 -0700 (PDT)
User-Agent: G2/1.0
Message-ID: 
 <9fad8439-d8fb-4390-be45-33875fd97476@m13g2000yqb.googlegroups.com>
Subject: pycassa timeouts resolved by killing a random node in the ring
From: Jason Harvey <alienth@gmail.com>
To: user@cassandra.apache.org
Content-Type: text/plain; charset=ISO-8859-1

Interesting issue this morning.

My apps started throwing a bunch of pycassa timeouts all of a sudden.
The ring looked perfect. No load issues anywhere, and no errors in the
logs.

The site was basically down, so I got desperate and whacked a random
node in the ring. As soon as gossip saw it go down, the timeouts went
away. Thinking that was kinda crazy, I started the node back up. As
soon as it rejoined the ring, pycassa started timing out again. I then
killed another random node, far away from the first node I killed, and
the timeouts stopped again. Started it back up, and the timeouts
started again when it rejoined the ring.

Repeated this process once more just to make sure I wasn't insane, and
the same result happened. Killing any single node, anywhere in the
ring, fixes my timeouts.

Actively able to repro this. I am having to just keep one node down
right now so the site doesn't break. Desperate for any suggestions or
advice on this.

Using pycassa 1.0.7. Timeout is set to 15 seconds, with 3 retries.
Reads and writes are in quorum. 27 nodes in the ring, with an RF of 3.

Thanks,
Jason