Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 1C821200B33 for ; Wed, 15 Jun 2016 04:59:55 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 1B47C160A5F; Wed, 15 Jun 2016 02:59:55 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 65792160A06 for ; Wed, 15 Jun 2016 04:59:54 +0200 (CEST) Received: (qmail 27771 invoked by uid 500); 15 Jun 2016 02:59:53 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 27757 invoked by uid 99); 15 Jun 2016 02:59:53 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 Jun 2016 02:59:53 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 48C3E2C033A for ; Wed, 15 Jun 2016 02:59:53 +0000 (UTC) Date: Wed, 15 Jun 2016 02:59:53 +0000 (UTC) From: "Paulo Motta (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-12008) Allow retrying failed streams (or stop them from failing) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 15 Jun 2016 02:59:55 -0000 [ https://issues.apache.org/jira/browse/CASSANDRA-12008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331030#comment-15331030 ] Paulo Motta commented on CASSANDRA-12008: ----------------------------------------- In this specific case it seems the streaming failed due to low {{streaming_socket_timeout}} value. We just found out our previous default of 1 hour was too low, and raised that to 24 hours on CASSANDRA-11840, on 3.0.7. Could you try increasing that and see if it helps with failed decommissions? > Allow retrying failed streams (or stop them from failing) > --------------------------------------------------------- > > Key: CASSANDRA-12008 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12008 > Project: Cassandra > Issue Type: Bug > Components: Lifecycle > Reporter: Tom van der Woerdt > > We're dealing with large data sets (multiple terabytes per node) and sometimes we need to add or remove nodes. These operations are very dependent on the entire cluster being up, so while we're joining a new node (which sometimes takes 6 hours or longer) a lot can go wrong and in a lot of cases something does. > It would be great if the ability to retry streams was implemented. > Example to illustrate the problem : > {code} > 03:18 PM ~ $ nodetool decommission > error: Stream failed > -- StackTrace -- > org.apache.cassandra.streaming.StreamException: Stream failed > at org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(StreamEventJMXNotifier.java:85) > at com.google.common.util.concurrent.Futures$6.run(Futures.java:1310) > at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:457) > at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156) > at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145) > at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202) > at org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:210) > at org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:186) > at org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:430) > at org.apache.cassandra.streaming.StreamSession.complete(StreamSession.java:622) > at org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:486) > at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:274) > at java.lang.Thread.run(Thread.java:745) > 08:04 PM ~ $ nodetool decommission > nodetool: Unsupported operation: Node in LEAVING state; wait for status to become normal or restart > See 'nodetool help' or 'nodetool help '. > {code} > Streaming failed, probably due to load : > {code} > ERROR [STREAM-IN-/] 2016-06-14 18:05:47,275 StreamSession.java:520 - [Stream #] Streaming error occurred > java.net.SocketTimeoutException: null > at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:211) ~[na:1.8.0_77] > at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) ~[na:1.8.0_77] > at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385) ~[na:1.8.0_77] > at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:54) ~[apache-cassandra-3.0.6.jar:3.0.6] > at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:268) ~[apache-cassandra-3.0.6.jar:3.0.6] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77] > {code} > If implementing retries is not possible, can we have a 'nodetool decommission resume'? -- This message was sent by Atlassian JIRA (v6.3.4#6332)