Return-Path: X-Original-To: apmail-cassandra-dev-archive@www.apache.org Delivered-To: apmail-cassandra-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BA6DFD9FC for ; Thu, 4 Oct 2012 18:26:08 +0000 (UTC) Received: (qmail 84143 invoked by uid 500); 4 Oct 2012 18:26:07 -0000 Delivered-To: apmail-cassandra-dev-archive@cassandra.apache.org Received: (qmail 84113 invoked by uid 500); 4 Oct 2012 18:26:07 -0000 Mailing-List: contact dev-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list dev@cassandra.apache.org Received: (qmail 84102 invoked by uid 99); 4 Oct 2012 18:26:07 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Oct 2012 18:26:07 +0000 X-ASF-Spam-Status: No, hits=1.6 required=5.0 tests=FROM_12LTRDOM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [66.111.4.25] (HELO out1-smtp.messagingengine.com) (66.111.4.25) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Oct 2012 18:25:59 +0000 Received: from compute3.internal (compute3.nyi.mail.srv.osa [10.202.2.43]) by gateway1.nyi.mail.srv.osa (Postfix) with ESMTP id 2C9FB20765 for ; Thu, 4 Oct 2012 14:25:36 -0400 (EDT) Received: from frontend2.nyi.mail.srv.osa ([10.202.2.161]) by compute3.internal (MEProxy); Thu, 04 Oct 2012 14:25:36 -0400 DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d= messagingengine.com; h=message-id:date:from:mime-version:to :subject:content-type; s=smtpout; bh=cmq74HtEuvzdGrUppFbh5UTOmew =; b=cg692y7VE5/eXDDkMWhtQy/HZc4rsarzpZbWb5p79vK9nmZRziQHmXPg184 /sTSttOpNrN5OZJ7+lwOYWdi7V+WpbTqPQIoJKYee5iWaMWmmb+MPUsgNPCqeR/i fq4DXqsEw1fi4hl4T+y3RZJrW8d3pTwUqdg36N/ZDHR7cIu4= X-Sasl-enc: WVNcd5+8/kCBPuYBhjp87U3EzCEtckJM59SUyA9CYoEZ 1349375135 Received: from Kirks-MacBook-Pro.local (unknown [67.169.147.245]) by mail.messagingengine.com (Postfix) with ESMTPA id ABE9E482511 for ; Thu, 4 Oct 2012 14:25:35 -0400 (EDT) Message-ID: <506DD49E.7000506@mustardgrain.com> Date: Thu, 04 Oct 2012 11:25:34 -0700 From: Kirk True User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:15.0) Gecko/20120907 Thunderbird/15.0.1 MIME-Version: 1.0 To: dev@cassandra.apache.org Subject: Expected behavior of number of nodes contacted during CL=QUORUM read Content-Type: multipart/alternative; boundary="------------010609080805040607000906" --------------010609080805040607000906 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Hi all, Test scenario: 4 nodes (.1, .2, .3, .4) RF=3 CL=QUORUM 1.1.2 I noticed that in ReadCallback's constructor, it determines the 'blockfor' number of 2 for RF=3, CL=QUORUM. According to the API page on the wiki[1] for reads at CL=QUORUM: Will query *all* replicas and return the record with the most recent timestamp once it has at least a majority of replicas (N / 2 + 1) reported. However, in ReadCallback's constructor, it determines blockfor to be 2, then calls filterEndpoints. filterEndpoints is given a list of the three replicas, but at the very end of the method, the endpoint list to only two replicas. Those two replicas are then used in StorageProxy to execute the read/digest calls. So it ends up as 2 nodes, not all three as stated on the wiki. In my test case, I kill a node and then immediately issue a query for a key that has a replica on the downed node. For the live nodes in the system, it doesn't immediately know that the other node is down yet. Rather than contacting *all* nodes as the wiki states, the coordinator contacts only two -- one of which is the downed node. Since it blocks for two, one of which is down, the query times out. Attempting the read again produces the same effect, even when trying different nodes as coordinators. I end up retrying a few times until the failure detectors on the live nodes realize that the node is down. So, the end result is that if a client attempts to read a row that has a replica on a newly downed node, it will timeout repeatedly until the ~30 seconds failure detector window has passed -- even though there are enough live replicas to satisfy the request. We basically have a scenario wherein a value is not retrievable for upwards of 30 seconds. The percentage of keys that exhibit this possibility shrinks as the ring grows, but it's still non-zero. This doesn't seem right and I'm sure I'm missing something. Thanks, Kirk [1] http://wiki.apache.org/cassandra/API --------------010609080805040607000906--