Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of jbellis@gmail.com designates
 209.85.212.44 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :cc:content-type;
        b=N55uh9iIRvJd9BNjbk1AUZhM8mAymRNuCoovWsrmkMYb+aj44u8Ux/rbbffcnVOXue
         BYGtLcSu/v/nAgDaq6RuCl4BK6IXzj5EUNyoah3Qb9s7Ot7j0X3+J8OhY9bsfvEbTUuC
         2GqB1qG8wQqZGL8GHnra86ToEoVmKgYlU+KI4=
MIME-Version: 1.0
In-Reply-To: <BANLkTi=1KyedT+Q1xFmNKxn5XEjwDipm4Q@mail.gmail.com>
References: <BANLkTi=1KyedT+Q1xFmNKxn5XEjwDipm4Q@mail.gmail.com>
From: Jonathan Ellis <jbellis@gmail.com>
Date: Wed, 13 Apr 2011 12:58:50 -0500
Message-ID: <BANLkTi=FNcgaxzBEia-wM_pPz=RT=pLvXg@mail.gmail.com>
Subject: Re: CL.ONE reads and SimpleSnitch unnecessary timeouts
To: user@cassandra.apache.org
Cc: Erik Onnen <eonnen@gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

First, our contract with the client says "we'll give you the answer or
a timeout after rpc_timeout." Once we start trying to cheat on that
the client has no guarantee anymore when it should expect a response
by. So that feels iffy to me.

Second, retrying to a different node isn't expected to give
substantially better results than the client issuing a retry itself if
that's what it wants, since by the time we timeout once then FD and/or
dynamic snitch should route the request to another node for the retry
without adding additional complexity to StorageProxy.  (If that's not
what you see in practice, then we probably have a dynamic snitch bug.)

On Wed, Apr 13, 2011 at 12:32 PM, Erik Onnen <eonnen@gmail.com> wrote:
> Sorry for the complex setup, took a while to identify the behavior and
> I'm still not sure I'm reading the code correctly.
>
> Scenario:
>
> Six node ring w/ SimpleSnitch and RF3. For the sake of discussion
> assume the token space looks like:
>
> node-0 1-10
> node-1 11-20
> node-2 21-30
> node-3 31-40
> node-4 41-50
> node-5 51-60
>
> In this scenario we want key 35 where nodes 3,4 and 5 are natural
> endpoints. Client is connected to node-0, node-1 or node-2. node-3
> goes into a full GC lasting 12 seconds.
>
> What I think we're seeing is that as long as we read with CL.ONE *and*
> are connected to 0,1 or 2, we'll never get a response for the
> requested key until the failure detector kicks in and convicts 3
> resulting in reads spilling over to the other endpoints.
>
> We've tested this by switching to CL.QUORUM and since haven't seen
> read timeouts during big GCs.
>
> Assuming the above, is this behavior really correct? We have copies of
> the data on two other nodes but because this snitch config always
> picks node-3, we always timeout until conviction which can take up to
> 8 seconds sometimes. Shouldn't the read attempt to pick a different
> endpoint in the case of the first timeout rather than repeatedly
> trying a node that isn't responding?
>
> Thanks,
> -erik
>


-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com