Mailing-List: contact dev-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of jbellis@gmail.com designates
 209.85.220.44 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <506DD49E.7000506@mustardgrain.com>
References: <506DD49E.7000506@mustardgrain.com>
From: Jonathan Ellis <jbellis@gmail.com>
Date: Thu, 4 Oct 2012 14:04:20 -0500
Message-ID: 
 <CALdd-zgUYVVfAynktcnMS9Li1jm5xwhRy4Fh_PPx3ftbFSnixw@mail.gmail.com>
Subject: Re: Expected behavior of number of nodes contacted during CL=QUORUM
 read
To: dev@cassandra.apache.org
Content-Type: text/plain; charset=ISO-8859-1

The API page is incorrect.  Cassandra only contacts enough nodes to
satisfy the requested CL.
https://issues.apache.org/jira/browse/CASSANDRA-4705 and
https://issues.apache.org/jira/browse/CASSANDRA-2540 are relevant to
the fragility that can result as you say.  (Although, unless you are
doing zero read repairs I would expect the dynamic snitch to steer
requests away from the unresponsive node a lot faster than 30s.)

On Thu, Oct 4, 2012 at 1:25 PM, Kirk True <kirk@mustardgrain.com> wrote:
> Hi all,
>
> Test scenario:
>
>     4 nodes (.1, .2, .3, .4)
>     RF=3
>     CL=QUORUM
>     1.1.2
>
> I noticed that in ReadCallback's constructor, it determines the 'blockfor'
> number of 2 for RF=3, CL=QUORUM.
>
> According to the API page on the wiki[1] for reads at CL=QUORUM:
>
>    Will query *all* replicas and return the record with the most recent
>    timestamp once it has at least a majority of replicas (N / 2 + 1)
>    reported.
>
>
> However, in ReadCallback's constructor, it determines blockfor to be 2, then
> calls filterEndpoints. filterEndpoints is given a list of the three
> replicas, but at the very end of the method, the endpoint list to only two
> replicas. Those two replicas are then used in StorageProxy to execute the
> read/digest calls. So it ends up as 2 nodes, not all three as stated on the
> wiki.
>
> In my test case, I kill a node and then immediately issue a query for a key
> that has a replica on the downed node. For the live nodes in the system, it
> doesn't immediately know that the other node is down yet. Rather than
> contacting *all* nodes as the wiki states, the coordinator contacts only two
> -- one of which is the downed node. Since it blocks for two, one of which is
> down, the query times out. Attempting the read again produces the same
> effect, even when trying different nodes as coordinators. I end up retrying
> a few times until the failure detectors on the live nodes realize that the
> node is down.
>
> So, the end result is that if a client attempts to read a row that has a
> replica on a newly downed node, it will timeout repeatedly until the ~30
> seconds failure detector window has passed -- even though there are enough
> live replicas to satisfy the request. We basically have a scenario wherein a
> value is not retrievable for upwards of 30 seconds. The percentage of keys
> that exhibit this possibility shrinks as the ring grows, but it's still
> non-zero.
>
> This doesn't seem right and I'm sure I'm missing something.
>
> Thanks,
> Kirk
>
> [1] http://wiki.apache.org/cassandra/API


-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com