Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2BE359631 for ; Thu, 10 Nov 2011 19:05:51 +0000 (UTC) Received: (qmail 65926 invoked by uid 500); 10 Nov 2011 19:05:49 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 65902 invoked by uid 500); 10 Nov 2011 19:05:49 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 65894 invoked by uid 99); 10 Nov 2011 19:05:49 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Nov 2011 19:05:49 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [69.50.2.37] (HELO ht1.hostedexchange.local) (69.50.2.37) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Nov 2011 19:05:43 +0000 Received: from mbx2.hostedexchange.local ([172.16.69.30]) by HT1.hostedexchange.local ([172.16.69.39]) with mapi; Thu, 10 Nov 2011 11:05:20 -0800 From: Shu Zhang To: "user@cassandra.apache.org" Date: Thu, 10 Nov 2011 11:05:19 -0800 Subject: RE: propertyfilesnitch problem Thread-Topic: propertyfilesnitch problem Thread-Index: Acyfd4cBEGE9rP/4TiCK3fzyoNKiCQAYrHKg Message-ID: References: , In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org At first, I was also thinking that one or more nodes in the cluster are bro= ken or not responding. But through nodetool cfstats, it looks like all the = nodes are working as expected and pings gives me the expected inter-node la= tencies. Also the scores calculated by dynamic snitch in the steady state s= eem to correspond to how we configure the network topology.=20 We're not timing out, but comparing periods when the dynamic snitch has app= ropriate scores and when it doesn't, the latency of LOCAL_QUORUM operations= gets bumped up from ~10ms to ~100ms. Quorum operations remain at ~100ms re= gardless of dynamic snitch settings. We maintain the consistent load throug= h the tests and there are no feedback mechanisms. Thanks, Shu ________________________________________ From: scode@scode.org [scode@scode.org] On Behalf Of Peter Schuller [peter.= schuller@infidyne.com] Sent: Wednesday, November 09, 2011 11:07 PM To: user@cassandra.apache.org Subject: Re: propertyfilesnitch problem > 2. With the same setup, after each period as defined by dynamic_snitch_re= set_interval_in_ms, the LOCAL_QUORUM performance greatly degrades before dr= astically improving again within a minute. This part sounds to me like one or more nodes in the cluster are either broken and not responding at all, or overloaded. Restarts will tend to temporarily cause additional pressure on nodes (particularly I/O due to cache eviction issues). Because the dynamic snitch won't ever know that the node is slow (after a reset) until requests start actually timing out, it can be up to rpc_timeout second before it gets snitched away. That sounds like what you're seeing. On ever reset, an rpc_timeout period of poor latency for clients. Is rpc_timeout 60 seconds? > 4. With dynamic snitch turned on, QUORUM operations' performance is about= the same as using LOCAL_QUORUM when the dynamic snitch is off or the first= minute after a restart with the snitch turned on. This is strange, unless it is co-incidental. Can you be more specific about the performance characteristics you're seeing when degraded? For example: * High latency, or timeouts? * Are you getting Unavailable exceptions? * Are you maintaining the same overall throughput or is there a feedback mechanism such that when queries have high latency the request rate decreases? * Which data points are you using to consider something degraded? What's matching in the QUORUM and LOCAL_QUOROM w/o dynsnitch cases? -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)