From user-return-9558-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Thu Sep 30 02:40:53 2010 Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 31329 invoked from network); 30 Sep 2010 02:40:53 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 30 Sep 2010 02:40:53 -0000 Received: (qmail 58833 invoked by uid 500); 30 Sep 2010 02:40:51 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 58698 invoked by uid 500); 30 Sep 2010 02:40:49 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 58690 invoked by uid 99); 30 Sep 2010 02:40:48 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 Sep 2010 02:40:48 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,MIME_QP_LONG_LINE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a51.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 Sep 2010 02:40:44 +0000 Received: from homiemail-a51.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a51.g.dreamhost.com (Postfix) with ESMTP id 102532E806D for ; Wed, 29 Sep 2010 19:40:21 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=to:from :subject:date:message-id:content-type:mime-version:in-reply-to; q=dns; s=thelastpickle.com; b=EhWv5vxS1zQJqdllhnkCj+vZKhvc7Sw3E +LO2eL8ISGPX61StAs8vcc8d2hXNwXYm1iWdHQjrO2F+LrwB5qHVsuluOrJlsMjC MYp/O09axcqEq/jniQhyiuS9ry1ZgHP3D98leBw4cvo6YH/Z9C89zfcFksFzIVtX kOHwuHOW0U= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=to :from:subject:date:message-id:content-type:mime-version: in-reply-to; s=thelastpickle.com; bh=9+NCD1n/VWoUFibZWEc049pdi7Y =; b=KUAJ5NRoMR5L1tZwjBeNKySpP8NaDWpUTZiUHOZKHAO6WiV754YSSlqrYJ/ gpvZXqhZywmRhQHcDhAa0HWbumJRu1dAWHsQDvQz4fNvTn/EfizifQgP2KXpW3ak X74p5DaTlLoZJdV/bKeFKQLvBuw6vnPkXDAv8BY0Fcd2A3fg= Received: from localhost (webms.mac.com [17.148.16.116]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a51.g.dreamhost.com (Postfix) with ESMTPSA id 0132F2E806A for ; Wed, 29 Sep 2010 19:40:20 -0700 (PDT) To: user@cassandra.apache.org From: Aaron Morton Subject: Re: Marking each node down before rolling restart Date: Thu, 30 Sep 2010 02:40:20 GMT X-Mailer: MobileMe Mail (1C3203) Message-id: <79154c41-a3b0-c6d4-9f3a-d2bcf2846a86@me.com> Content-Type: multipart/alternative; boundary=Apple-Webmail-42--1ee93105-9048-e5cf-84de-b862fc879d83 MIME-Version: 1.0 In-Reply-To: --Apple-Webmail-42--1ee93105-9048-e5cf-84de-b862fc879d83 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=ISO-8859-1; format=flowed I just ran nodetool drain in a 3 node cluster that was not serving any req= uests, the other nodes picked up the change in about 10 seconds.=0A=0AOn t= he node I drained=A0=0A=A0INFO [RMI TCP Connection(39)-192.168.34.31] 2010= -09-30 15:18:03,281 StorageService.java (line 474) Starting drain process=0A= =A0INFO [RMI TCP Connection(39)-192.168.34.31] 2010-09-30 15:18:03,282 Mes= sagingService.java (line 348) Shutting down MessageService...=0A=A0INFO [A= CCEPT-sorb/192.168.34.31] 2010-09-30 15:18:03,289 MessagingService.java (l= ine 529) MessagingService shutting down server thread.=0A=A0INFO [RMI TCP = Connection(39)-192.168.34.31] 2010-09-30 15:18:03,290 MessagingService.jav= a (line 365) Shutdown complete (no further commands will be processed)=0A=A0= INFO [RMI TCP Connection(39)-192.168.34.31] 2010-09-30 15:18:03,339 Storag= eService.java (line 474) Node is drained=0A=0AOne =A0of the others=0A=A0IN= FO [Timer-0] 2010-09-30 15:18:12,753 Gossiper.java (line 196) InetAddress = /192.168.34.31 is now dead.=0ADEBUG [Timer-0] 2010-09-30 15:18:12,753 Mess= agingService.java (line 134) Resetting pool for /192.168.34.31=0A=0AEither= way, I would say it's safer to drain the node first. As it writes out the= SSTables and drains the log, so after the reboot the server will not need= to play forward the log. This may be a good thing in the event of an issu= e with the upgrade.=A0=0A=0AMy guess is:=0A- drain the node=0A- other node= s can still read from it, it will actively reject writes (because=A0the=A0= Messaging=A0Service is down). So no timeouts.=0A- wait until the down stat= e of the node is=A0propagated=A0around the cluster, then shut it down.=A0=0A= =A0=0AI may be able to test out the theory under a light load later today = or tomorrow. Anyone else have any thoughts?=0A=0AAaron=0A=0A=0AOn 30 Sep, = 2010,at 02:54 PM, Justin Sanders wrote:=0A=0AIt seems = to be about 15 seconds after killing a node before the other nodes report = it being down. =A0=0A=0AWe are running a 9 node cluster with RF=3D3, all r= eads and writes at quorum. =A0I was making the same assumption you are, th= at an operation would complete fine at quorum with only one node down sinc= e the other two nodes would be able to respond.=0A=0AJustin=0A=0A=0AOn Wed= , Sep 29, 2010 at 5:58 PM, Aaron Morton wrote:=0A= Ah, that was not exactly what you were after. I do not know how long it ta= kes gossip / failure detector to detect a down node.=A0=0A=0AIn your case = what is the CF you're using for reads and what is your RF? The hope would = be that taking one node down at a time would leave enough server running t= o serve the request.=A0AFAIK the coordinator will make a read request to t= he first node responsible for the row, and only ask for a digest =A0from t= he others. So there may be a case where it has to timeout reading from the= first node before asking for the full data from the others.=0A=0AA hack s= olution may be to reduce the=A0rpc_timeout_in_ms=0A=0AMay need some adult = supervision to answer this one.=A0=0A=0AAaron=0A=0A=0AOn 30 Sep, 2010,at 1= 0:45 AM, Aaron Morton wrote:=0A=0ATry nodetool d= rain=A0=0A=0AFlushes all memtables for a node and causes the node to stop = accepting write operations. Read operations will continue to work. This is= typically used before upgrading a node to a new version of Cassandra.=0Ah= ttp://www.riptano.com/docs/0.6.5/utils/nodetool=0A=0AAaron=0A=0A=0AOn 30 S= ep, 2010,at 10:15 AM, Justin Sanders wrote:=0A=0AI = looked through the documentation but couldn't find anything. =A0I was wond= ering if there is a way to manually mark a node "down" in the cluster inst= ead of killing the cassandra process and letting the other nodes figure ou= t the node is no longer up.=0A=0AThe reason I ask is because we are having= an issue when we perform rolling restarts on the cluster. =A0Basically re= ad requests that come in on other nodes will block while they are waiting = on the node that was just killed to be marked down. =A0Before they realize= the node is offline they will throw a=A0TimedOutException.=0A=0AIf I coul= d mark the node being down ahead of time this timeout period could be avoi= ded. =A0Any help is appreciated.=0A=0AJustin=0A=0A --Apple-Webmail-42--1ee93105-9048-e5cf-84de-b862fc879d83 Content-Type: multipart/related; type="text/html"; boundary=Apple-Webmail-86--1ee93105-9048-e5cf-84de-b862fc879d83 --Apple-Webmail-86--1ee93105-9048-e5cf-84de-b862fc879d83 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=ISO-8859-1;
I just ran nodetool drain in a 3 node cluster that was not servin= g any requests, the other nodes picked up the change in about 10 seconds.<= /div>

On the node I drained 
 = INFO [RMI TCP Connection(39)-192.168.34.31] 2010-09-30 15:18:03,281 Storag= eService.java (line 474) Starting drain process
 INFO [RMI = TCP Connection(39)-192.168.34.31] 2010-09-30 15:18:03,282 MessagingService= java (line 348) Shutting down MessageService...
 INFO [ACC= EPT-sorb/192.168.34.31] 2010-09-30 15:18:03,289 MessagingService.java (lin= e 529) MessagingService shutting down server thread.
 INFO = [RMI TCP Connection(39)-192.168.34.31] 2010-09-30 15:18:03,290 MessagingSe= rvice.java (line 365) Shutdown complete (no further commands will be proce= ssed)
 INFO [RMI TCP Connection(39)-192.168.34.31] 2010-09-= 30 15:18:03,339 StorageService.java (line 474) Node is drained
=

One  of the others
 INFO [Tim= er-0] 2010-09-30 15:18:12,753 Gossiper.java (line 196) InetAddress /192.16= 8.34.31 is now dead.
DEBUG [Timer-0] 2010-09-30 15:18:12,753 Mes= sagingService.java (line 134) Resetting pool for /192.168.34.31
=
Either way, I would say it's safer to drain the node first.= As it writes out the SSTables and drains the log, so after the reboot the= server will not need to play forward the log. This may be a good thing in= the event of an issue with the upgrade. 

My= guess is:
- drain the node
- other nodes can still re= ad from it, it will actively reject writes (because the Messagin= g Service is down). So no timeouts.
- wait until the down s= tate of the node is propagated around the cluster, then shut it = down. 
 
I may be able to test out the theor= y under a light load later today or tomorrow. Anyone else have any thought= s?

Aaron


On 30 Sep,= 2010,at 02:54 PM, Justin Sanders <justin@bronto.com> wrote:

=
It seems to be about 15 seconds = after killing a node before the other nodes report it being down.  
We are running a 9 node cluster with RF=3D3, all reads and writ= es at quorum.  I was making the same assumption you are, that an oper= ation would complete fine at quorum with only one node down since the othe= r two nodes would be able to respond.
=0A

Justin


On Wed, Sep 29, 201= 0 at 5:58 PM, Aaron Morton <aaron@thelastpickle.com> wrote:
=0A=0A
Ah, that was not exactly what you w= ere after. I do not know how long it takes gossip / failure detector to de= tect a down node. 
=0A=0A

In your case what i= s the CF you're using for reads and what is your RF? The hope would be tha= t taking one node down at a time would leave enough server running to serv= e the request. AFAIK the coordinator will make a read request to the = first node responsible for the row, and only ask for a digest  from t= he others. So there may be a case where it has to timeout reading from the= first node before asking for the full data from the others.
=0A=0A
A hack solution may be to reduce the rpc_timeout_in= _ms

May need some adult supervision to answer thi= s one. 

Aaron
<= /font>
=0A=0A


On 30 Sep, 2010,at 10:45 AM,= Aaron Morton <aaron@thelastpickle.com<= /a>> wrote:

=0A
Tr= y nodetool drain 

Flushes= all memtables for a node and causes the node to stop accepting write oper= ations. Read operations will continue to work. This is typically used befo= re upgrading a node to a new version of Cassandra.
=0A=0A
=0A=0A

Aaron
=0A=0A


On 30 Sep, 2010,at 10:15 AM, Just= in Sanders <justin@justinjas.com> wrot= e:
=0A=0A
I looked through= the documentation but couldn't find anything.  I was wondering if th= ere is a way to manually mark a node "down" in the cluster instead of kill= ing the cassandra process and letting the other nodes figure out the node = is no longer up.
=0A=0A=0A
The reason I ask is because w= e are having an issue when we perform rolling restarts on the cluster. &nb= sp;Basically read requests that come in on other nodes will block while th= ey are waiting on the node that was just killed to be marked down.  B= efore they realize the node is offline they will throw a TimedOutException=
=0A=0A=0A

=
If I could mark the node being down ahea= d of time this timeout period could be avoided.  Any help is apprecia= ted.
=0A=0A=0A

Justin
=0A=0A

=0A
=0A
--Apple-Webmail-86--1ee93105-9048-e5cf-84de-b862fc879d83-- --Apple-Webmail-42--1ee93105-9048-e5cf-84de-b862fc879d83--