Return-Path: Delivered-To: apmail-incubator-cassandra-user-archive@minotaur.apache.org Received: (qmail 79779 invoked from network); 17 Dec 2009 16:55:39 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 17 Dec 2009 16:55:39 -0000 Received: (qmail 5203 invoked by uid 500); 17 Dec 2009 16:55:39 -0000 Delivered-To: apmail-incubator-cassandra-user-archive@incubator.apache.org Received: (qmail 5167 invoked by uid 500); 17 Dec 2009 16:55:38 -0000 Mailing-List: contact cassandra-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: cassandra-user@incubator.apache.org Delivered-To: mailing list cassandra-user@incubator.apache.org Received: (qmail 5158 invoked by uid 99); 17 Dec 2009 16:55:38 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Dec 2009 16:55:38 +0000 X-ASF-Spam-Status: No, hits=-2.6 required=5.0 tests=BAYES_00 X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of rrabah@playdom.com designates 74.125.149.67 as permitted sender) Received: from [74.125.149.67] (HELO na3sys009aog101.obsmtp.com) (74.125.149.67) by apache.org (qpsmtpd/0.29) with SMTP; Thu, 17 Dec 2009 16:55:36 +0000 Received: from source ([209.85.222.190]) by na3sys009aob101.postini.com ([74.125.148.12]) with SMTP ID DSNKSypicg5BngveW8TEIfrieiPbqND+ZAYm@postini.com; Thu, 17 Dec 2009 08:55:15 PST Received: by pzk28 with SMTP id 28so1679352pzk.27 for ; Thu, 17 Dec 2009 08:55:14 -0800 (PST) MIME-Version: 1.0 Received: by 10.141.13.9 with SMTP id q9mr1898362rvi.185.1261068914196; Thu, 17 Dec 2009 08:55:14 -0800 (PST) In-Reply-To: References: Date: Thu, 17 Dec 2009 08:55:14 -0800 Message-ID: Subject: Re: TimedOutException From: Ramzi Rabah To: cassandra-user@incubator.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I added some debugging code to capture the time a read takes (getColumnFamily) and the time the road trip weakRemoteRead takes. The time it takes to read columns is negligible, so it doesn't seem a problem with getColumnFamily. The time it takes for weakRemoteRead however is > 5 seconds in some cases. So looking at some more debugging output, the log indicates that the packets are in the process of being sent by weakRemoteRead to the correct target node, but for some reason, the target node does not have any reference in the log that it handled the get at all. Couple other things to note: 1- I restarted the nodes one after another, while there was traffic going to them. Don't know if that will throw off cassandra or that the whole thing is a network congestion problem? 2- Read stats on the keyspace level indicate NaN value for Read latency which seems like a bug? Thanks Ramzi On Wed, Dec 16, 2009 at 12:07 PM, Jonathan Ellis wrote: > On Wed, Dec 16, 2009 at 12:46 PM, Ramzi Rabah wrote: >> We are observing increasing number of TimedOutExceptions in cassandra >> 0.5 trunk although the load seems fairly low (about 400 reads/writes >> per second). >> cfstats reports that operations are taking less than 2 ms on average. >> >> 2 Things I have noticed looking at the source code. >> >> 1- TimedOutExceptions are silently swallowed by Cassandra and not >> reported in the logs even at debug level > > It's reported to the client. =A0Hardly "swallowed" :) > >> 2- readstats does not account for these long time running queries that >> time out. > > Right. =A0But the CF-level stats do. > >> I'm wondering, what could be causing the system to go haywire like >> this? > > Hard to say without more information. =A0One shot in the dark is that > get_key_range is a major offender sometimes, as well as workloads that > do lots of deletes + re-inserts for the same keys. > > -Jonathan >