Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 11445 invoked from network); 13 Apr 2011 17:59:39 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 13 Apr 2011 17:59:39 -0000 Received: (qmail 80405 invoked by uid 500); 13 Apr 2011 17:59:37 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 80347 invoked by uid 500); 13 Apr 2011 17:59:37 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 80339 invoked by uid 99); 13 Apr 2011 17:59:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Apr 2011 17:59:37 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jbellis@gmail.com designates 209.85.212.44 as permitted sender) Received: from [209.85.212.44] (HELO mail-vw0-f44.google.com) (209.85.212.44) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Apr 2011 17:59:32 +0000 Received: by vws12 with SMTP id 12so831884vws.31 for ; Wed, 13 Apr 2011 10:59:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-type; bh=FpP8NBmL/DhXw8fhNC43advfUSO+1nQNraiuYJGc+sw=; b=RlAb/igBZN6PzCrNOHz927vySEaZOwIL0U+MHTs7rnbPVOq2Aa+eJsQcV5S9NootUn QHaCi+6ojnAm5W5BRau6rhzZnUfYkNp5V29c4zdnSRezAetSl7eLSTAmOpmU1gNuOxUf uPaoGfnVyjY3KlVFgK4cl5tK+UGlh9M7pFjbA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; b=N55uh9iIRvJd9BNjbk1AUZhM8mAymRNuCoovWsrmkMYb+aj44u8Ux/rbbffcnVOXue BYGtLcSu/v/nAgDaq6RuCl4BK6IXzj5EUNyoah3Qb9s7Ot7j0X3+J8OhY9bsfvEbTUuC 2GqB1qG8wQqZGL8GHnra86ToEoVmKgYlU+KI4= Received: by 10.52.89.18 with SMTP id bk18mr11837411vdb.270.1302717550052; Wed, 13 Apr 2011 10:59:10 -0700 (PDT) MIME-Version: 1.0 Received: by 10.52.114.33 with HTTP; Wed, 13 Apr 2011 10:58:50 -0700 (PDT) In-Reply-To: References: From: Jonathan Ellis Date: Wed, 13 Apr 2011 12:58:50 -0500 Message-ID: Subject: Re: CL.ONE reads and SimpleSnitch unnecessary timeouts To: user@cassandra.apache.org Cc: Erik Onnen Content-Type: text/plain; charset=ISO-8859-1 First, our contract with the client says "we'll give you the answer or a timeout after rpc_timeout." Once we start trying to cheat on that the client has no guarantee anymore when it should expect a response by. So that feels iffy to me. Second, retrying to a different node isn't expected to give substantially better results than the client issuing a retry itself if that's what it wants, since by the time we timeout once then FD and/or dynamic snitch should route the request to another node for the retry without adding additional complexity to StorageProxy. (If that's not what you see in practice, then we probably have a dynamic snitch bug.) On Wed, Apr 13, 2011 at 12:32 PM, Erik Onnen wrote: > Sorry for the complex setup, took a while to identify the behavior and > I'm still not sure I'm reading the code correctly. > > Scenario: > > Six node ring w/ SimpleSnitch and RF3. For the sake of discussion > assume the token space looks like: > > node-0 1-10 > node-1 11-20 > node-2 21-30 > node-3 31-40 > node-4 41-50 > node-5 51-60 > > In this scenario we want key 35 where nodes 3,4 and 5 are natural > endpoints. Client is connected to node-0, node-1 or node-2. node-3 > goes into a full GC lasting 12 seconds. > > What I think we're seeing is that as long as we read with CL.ONE *and* > are connected to 0,1 or 2, we'll never get a response for the > requested key until the failure detector kicks in and convicts 3 > resulting in reads spilling over to the other endpoints. > > We've tested this by switching to CL.QUORUM and since haven't seen > read timeouts during big GCs. > > Assuming the above, is this behavior really correct? We have copies of > the data on two other nodes but because this snitch config always > picks node-3, we always timeout until conviction which can take up to > 8 seconds sometimes. Shouldn't the read attempt to pick a different > endpoint in the case of the first timeout rather than repeatedly > trying a node that isn't responding? > > Thanks, > -erik > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com