Return-Path: Delivered-To: apmail-incubator-cassandra-user-archive@minotaur.apache.org Received: (qmail 11427 invoked from network); 9 Sep 2009 22:43:54 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 9 Sep 2009 22:43:54 -0000 Received: (qmail 42022 invoked by uid 500); 9 Sep 2009 22:43:54 -0000 Delivered-To: apmail-incubator-cassandra-user-archive@incubator.apache.org Received: (qmail 42001 invoked by uid 500); 9 Sep 2009 22:43:54 -0000 Mailing-List: contact cassandra-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: cassandra-user@incubator.apache.org Delivered-To: mailing list cassandra-user@incubator.apache.org Received: (qmail 41987 invoked by uid 99); 9 Sep 2009 22:43:54 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Sep 2009 22:43:54 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of simongsmith@gmail.com designates 209.85.211.177 as permitted sender) Received: from [209.85.211.177] (HELO mail-yw0-f177.google.com) (209.85.211.177) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Sep 2009 22:43:46 +0000 Received: by ywh7 with SMTP id 7so7031372ywh.21 for ; Wed, 09 Sep 2009 15:43:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=j2xa+Lkxbsj0ZwdsJx5GOXg2P1IQB3jjVqxCTe++QOs=; b=QnMbUOM0jbsXQwZG6kN5IxR7h8DIlljzuRoyqdy94AlDXELxRsafJDi1245zJPnz6e s74z1r0rwI6g6yExXUPyZZ9ubPdyuntIxxUTSFldERHjOM70pa1MUbApTIrKhlEkt2hg ARzF+3LVhThbxQTnb5Li1ukSr836bE7ZCfFUQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=Rvq9pFFPg+YRHoHv8S0iX+olWZZEtHK8e2KMO/m/O0UGJkGlsv7ewerTlk0o9gTfwt ph3q3fM5MLPrIY8DOQM7GN9TBZJSv6fOXNmBzfHLHmMqW3fO8ZP0LD+m2osdG32THh0U Km8RH/sqFOFWviA5cS6ALWl2tXSeWsTktFnvs= MIME-Version: 1.0 Received: by 10.150.115.8 with SMTP id n8mr1492120ybc.64.1252536205335; Wed, 09 Sep 2009 15:43:25 -0700 (PDT) In-Reply-To: References: Date: Wed, 9 Sep 2009 18:43:25 -0400 Message-ID: Subject: Re: get_key_range (CASSANDRA-169) From: Simon Smith To: cassandra-user@incubator.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org The error starts as soon as the downed node #5 goes down and lasts until I restart the downed node #5. bin/nodeprobe cluster is accurate (it knows quickly when #5 is down, and when it is up again) Since I set the replication set to 3, I'm confused as to why (after the first few seconds or so) there is an error just because one host is down temporarily. The way I have the test setup is that I have a script running on each of the nodes that is running the get_key_range over and over to "localhost". Depending on which node I take down, the behavior varies: if I take done one host, it is the only one giving errors (the other 4 nodes still work). For the other 4 situations, either 2 or 3 nodes continue to work (i.e. the downed node and either one or two other nodes are the ones giving errors). Note: the nodes that keep working, never fail at all, not even for a few seconds. I am running this on 4GB "cloud server" boxes in Rackspace, I can set up just about any test needed to help debug this and capture output or logs, and can give a Cassandra developer access if it would help. Of course I can include whatever config files or log files would be helpful, I just don't want to spam the list unless it is relevant. Thanks again, Simon On Tue, Sep 8, 2009 at 6:26 PM, Jonathan Ellis wrote: > getting temporary errors when a node goes down, until the other nodes' > failure detectors realize it's down, is normal. =A0(this should only > take a dozen seconds, or so.) > > but after that it should route requests to other nodes, and it should > also realize when you restart #5 that it is alive again. =A0those are > two separate issues. > > can you verify that "bin/nodeprobe cluster" shows that node 1 > eventually does/does not see #5 dead, and alive again? > > -Jonathan > > On Tue, Sep 8, 2009 at 5:05 PM, Simon Smith wrote: >> I'm seeing an issue similar to: >> >> http://issues.apache.org/jira/browse/CASSANDRA-169 >> >> Here is when I see it. =A0I'm running Cassandra on 5 nodes using the >> OrderPreservingPartitioner, and have populated Cassandra with 78 >> records, and I can use get_key_range via Thrift just fine. =A0Then, if I >> manually kill one of the nodes (if I kill off node #5), the node (node >> #1) which I've been using to call get_key_range will timeout and the >> error: >> >> =A0Thrift: Internal error processing get_key_range >> >> And the Cassandra output shows the same trace as in 169: >> >> ERROR - Encountered IOException on connection: >> java.nio.channels.SocketChannel[closed] >> java.net.ConnectException: Connection refused >> =A0 =A0 =A0 =A0at sun.nio.ch.SocketChannelImpl.checkConnect(Native Metho= d) >> =A0 =A0 =A0 =A0at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChann= elImpl.java:592) >> =A0 =A0 =A0 =A0at org.apache.cassandra.net.TcpConnection.connect(TcpConn= ection.java:349) >> =A0 =A0 =A0 =A0at org.apache.cassandra.net.SelectorManager.doProcess(Sel= ectorManager.java:131) >> =A0 =A0 =A0 =A0at org.apache.cassandra.net.SelectorManager.run(SelectorM= anager.java:98) >> WARN - Closing down connection java.nio.channels.SocketChannel[closed] >> ERROR - Internal error processing get_key_range >> java.lang.RuntimeException: java.util.concurrent.TimeoutException: >> Operation timed out. >> =A0 =A0 =A0 =A0at org.apache.cassandra.service.StorageProxy.getKeyRange(= StorageProxy.java:573) >> =A0 =A0 =A0 =A0at org.apache.cassandra.service.CassandraServer.get_key_r= ange(CassandraServer.java:595) >> =A0 =A0 =A0 =A0at org.apache.cassandra.service.Cassandra$Processor$get_k= ey_range.process(Cassandra.java:853) >> =A0 =A0 =A0 =A0at org.apache.cassandra.service.Cassandra$Processor.proce= ss(Cassandra.java:606) >> =A0 =A0 =A0 =A0at org.apache.thrift.server.TThreadPoolServer$WorkerProce= ss.run(TThreadPoolServer.java:253) >> =A0 =A0 =A0 =A0at java.util.concurrent.ThreadPoolExecutor.runWorker(Thre= adPoolExecutor.java:1110) >> =A0 =A0 =A0 =A0at java.util.concurrent.ThreadPoolExecutor$Worker.run(Thr= eadPoolExecutor.java:603) >> =A0 =A0 =A0 =A0at java.lang.Thread.run(Thread.java:675) >> Caused by: java.util.concurrent.TimeoutException: Operation timed out. >> =A0 =A0 =A0 =A0at org.apache.cassandra.net.AsyncResult.get(AsyncResult.j= ava:97) >> =A0 =A0 =A0 =A0at org.apache.cassandra.service.StorageProxy.getKeyRange(= StorageProxy.java:569) >> =A0 =A0 =A0 =A0... 7 more >> >> >> >> If it was giving an error just one time, I could just rely on catching >> the error and trying again. =A0But a get_key_range call to that node I >> was already making get_key_range queries against (node #1) never works >> again (it is still up and it responds fine to multiget Thrift calls), >> sometimes not even after I restart the down node (node #5). =A0I end up >> having to restart node #1 in addition to node #5. =A0The behavior for >> the other 3 nodes varies - some of them =A0are also unable to respond to >> get_key_range calls, but some of them do respond to get_key_range >> calls. >> >> My question is, what path should I go down in terms of reproducing >> this problem? =A0I'm using Aug 27 trunk code - should I update my >> Cassandra install prior to gathering more information for this issue, >> and if so, which version (0.4 or trunk). =A0If there is anyone who is >> familiar with this issue, could you let me know what I might be doing >> wrong, or what the next info-gathering step should be for me? >> >> Thank you, >> >> Simon Smith >> Arcode Corporation >> >