Return-Path: X-Original-To: apmail-cassandra-dev-archive@www.apache.org Delivered-To: apmail-cassandra-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 055C2DD09 for ; Thu, 4 Oct 2012 19:05:06 +0000 (UTC) Received: (qmail 19622 invoked by uid 500); 4 Oct 2012 19:05:05 -0000 Delivered-To: apmail-cassandra-dev-archive@cassandra.apache.org Received: (qmail 19571 invoked by uid 500); 4 Oct 2012 19:05:05 -0000 Mailing-List: contact dev-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list dev@cassandra.apache.org Received: (qmail 19563 invoked by uid 99); 4 Oct 2012 19:05:05 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Oct 2012 19:05:05 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jbellis@gmail.com designates 209.85.220.44 as permitted sender) Received: from [209.85.220.44] (HELO mail-pa0-f44.google.com) (209.85.220.44) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Oct 2012 19:05:00 +0000 Received: by mail-pa0-f44.google.com with SMTP id fb11so842115pad.31 for ; Thu, 04 Oct 2012 12:04:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=hPNJd36dOUY18V0V4toBkMo4f4wwSOos/ngu8Yogto0=; b=U1m1bdEZdC9EfHrK965js9tuUmRLiVyWGZOpmgRZmyMx11U/2LdupA4iHIpheoGnkb 6Hz1B1UEJIX4aHoGs5Z1+siAnodnN2aB72GQvo5LUlwyxavBopLejuSGZKQqz0V0QUc1 bpgbnh0im3PefhzJmjrnXNwNFANMnqDRJLq/XQ+MgQDTJ6NNS1H/sjJmlABdtGtvD8XF 7ZA8mAlBtwdNKgp7aT0MWR6Ry1ApQHfY3s0efL0d65Zh8++O82XzHfvSOHzvUlWWIHp+ aLY+nE7bHEAsGhVEh544uf8fmEtK529FslKPHVPJB58joV+4lqTrrsilThQE43/bYBgG pDOg== Received: by 10.68.217.67 with SMTP id ow3mr24445816pbc.26.1349377480364; Thu, 04 Oct 2012 12:04:40 -0700 (PDT) MIME-Version: 1.0 Received: by 10.66.251.131 with HTTP; Thu, 4 Oct 2012 12:04:20 -0700 (PDT) In-Reply-To: <506DD49E.7000506@mustardgrain.com> References: <506DD49E.7000506@mustardgrain.com> From: Jonathan Ellis Date: Thu, 4 Oct 2012 14:04:20 -0500 Message-ID: Subject: Re: Expected behavior of number of nodes contacted during CL=QUORUM read To: dev@cassandra.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org The API page is incorrect. Cassandra only contacts enough nodes to satisfy the requested CL. https://issues.apache.org/jira/browse/CASSANDRA-4705 and https://issues.apache.org/jira/browse/CASSANDRA-2540 are relevant to the fragility that can result as you say. (Although, unless you are doing zero read repairs I would expect the dynamic snitch to steer requests away from the unresponsive node a lot faster than 30s.) On Thu, Oct 4, 2012 at 1:25 PM, Kirk True wrote: > Hi all, > > Test scenario: > > 4 nodes (.1, .2, .3, .4) > RF=3 > CL=QUORUM > 1.1.2 > > I noticed that in ReadCallback's constructor, it determines the 'blockfor' > number of 2 for RF=3, CL=QUORUM. > > According to the API page on the wiki[1] for reads at CL=QUORUM: > > Will query *all* replicas and return the record with the most recent > timestamp once it has at least a majority of replicas (N / 2 + 1) > reported. > > > However, in ReadCallback's constructor, it determines blockfor to be 2, then > calls filterEndpoints. filterEndpoints is given a list of the three > replicas, but at the very end of the method, the endpoint list to only two > replicas. Those two replicas are then used in StorageProxy to execute the > read/digest calls. So it ends up as 2 nodes, not all three as stated on the > wiki. > > In my test case, I kill a node and then immediately issue a query for a key > that has a replica on the downed node. For the live nodes in the system, it > doesn't immediately know that the other node is down yet. Rather than > contacting *all* nodes as the wiki states, the coordinator contacts only two > -- one of which is the downed node. Since it blocks for two, one of which is > down, the query times out. Attempting the read again produces the same > effect, even when trying different nodes as coordinators. I end up retrying > a few times until the failure detectors on the live nodes realize that the > node is down. > > So, the end result is that if a client attempts to read a row that has a > replica on a newly downed node, it will timeout repeatedly until the ~30 > seconds failure detector window has passed -- even though there are enough > live replicas to satisfy the request. We basically have a scenario wherein a > value is not retrievable for upwards of 30 seconds. The percentage of keys > that exhibit this possibility shrinks as the ring grows, but it's still > non-zero. > > This doesn't seem right and I'm sure I'm missing something. > > Thanks, > Kirk > > [1] http://wiki.apache.org/cassandra/API -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com