Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 78A6C200B63 for ; Mon, 15 Aug 2016 10:46:22 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 7727C160AA7; Mon, 15 Aug 2016 08:46:22 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B9E38160A6C for ; Mon, 15 Aug 2016 10:46:21 +0200 (CEST) Received: (qmail 76773 invoked by uid 500); 15 Aug 2016 08:46:20 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 76762 invoked by uid 99); 15 Aug 2016 08:46:20 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Aug 2016 08:46:20 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id B99312C0032 for ; Mon, 15 Aug 2016 08:46:20 +0000 (UTC) Date: Mon, 15 Aug 2016 08:46:20 +0000 (UTC) From: "Benjamin Roth (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (CASSANDRA-12280) nodetool repair hangs MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 15 Aug 2016 08:46:22 -0000 [ https://issues.apache.org/jira/browse/CASSANDRA-12280?page=3Dcom.atla= ssian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId= =3D15420727#comment-15420727 ]=20 Benjamin Roth edited comment on CASSANDRA-12280 at 8/15/16 8:46 AM: -------------------------------------------------------------------- First thanks for your hints. I guess there are multiple issues related to t= hat. The first problem I spotted was that some nodes were "GC'ed to death" under= load. E.g. on a rebuild, bootstrap or a big repair. I could fix that by changing some settings like memtable_flush_writers, con= current_compactors, increasing the heap a bit (xmx,xms) and the young gener= ation (xmn). So this is actually not a bug, but after having read a loooooo= t of resources it seems the default settings are not unconditionally workin= g for larger setups. I read several blogs where buys mentioned that e.g. th= e 100MB/core xmn setting is "wrong" / "outdated". When something hangs, it is always a stream that hangs. I guess it is proba= bly waiting for "something" but I have to get more details first. I dont as= sume it is due to network overload.=20 Maybe offtopic but also maybe relates: The behaviour can be observed best on rebuild or bootstrap where there are = a few large streaming plans. The throughput is very high in the beginning (= up to 200mbit IN on the bootstrapping node) and degrades after a short time= (I guess when memtables start flushing) then it continues with 60-100 mbit= /s until one or more streams start to stall. Then throughput and load go co= mpletely down. Unfortunately I could not find an obvious reason for it. No = overloading, no cpu load, no disk load, no net load. Just an idle bootstrap= ping node and the existing nodes doing there jobs as always, also no overlo= ad. nodetool netstats reveals that there are stalled streams just lingering= around. When connecting via jconsole the thread shows mostly up as WAITING= seeming to be blocked by ArrayBlockingQueue. Sometimes they catch up after= an arbitrary time (minute, minutes, 30 minutes, an hour). Sometime they ju= st timeout. In case of bootstrapping this induces another ugly behaviour: W= hen the node continues to boot (and join) after a failed bootstrap (e.g. du= e to a stream timeout), the node is marked as "UP" and clients start to que= ry that node but then the client throws exceptions: "Cannot read from a boo= tstrapping node". Maybe this is worth a separate ticket?=20 What I observed this morning and also maybe relates: The system.batches table(s) grew over and over. One node had a batchlog wit= h over 70GB, no sign that there was something in process or shrinking the l= ogs or that is something wrong (No down nodes, no "bad" logs). I didn't use= any batches from within our application since weeks, so I dared to truncat= e the logs with the CQL TRUNCATE command. Maybe an hour later I saw that re= pairs hung again. Then I recognized, that there were compactions on system.= batches going on on each node in the cluster, also hanging. There were no (= debug-)logs about that. See output of compactionstats: https://gist.github.= com/brstgt/6277764f6e34b0531b9bfc5392491280. After having restarted all nod= es, compaction of batches worked again. Is it possible that repair uses bat= ches internally so that this blocked a repair? If you got any more hints for me or need some more information I can provid= e, I am happy to do so :) Unfortunately I am quite new to c* and obviously I am dropping every existi= ng brick but I am willing to learn (I am currently eating blogs + books), t= o help and to get my f***** cluster up and running stable :D Thanks so far! was (Author: brstgt): First thanks for your hints. I guess there are multiple issues related to t= hat. The first problem I spotted was that some nodes were "GC'ed to death" under= load. E.g. on a rebuild, bootstrap or a big repair. I could fix that by changing some settings like memtable_flush_writers, con= current_compactors, increasing the heap a bit (xmx,xms) and the young gener= ation (xmn). So this is actually not a bug, but after having read a loooooo= t of resources it seems the default settings are not unconditionally workin= g for larger setups. I read several blogs where buys mentioned that e.g. th= e 100MB/core xmn setting is "wrong" / "outdated". When something hangs, it is always a stream that hangs. I guess it is proba= bly waiting for "something" but I have to get more details first. I dont as= sume it is due to network overload.=20 Maybe offtopic but also maybe relates: The behaviour can be observed best on rebuild or bootstrap where there are = a few large streaming plans. The throughput is very high in the beginning (= up to 200mbit IN on the bootstrapping node) and degrades after a short time= (I guess when memtables start flushing) then it continues with 60-100 mbit= /s until one or more streams start to stall. Then throughput and load go co= mpletely down. Unfortunately I could not find an obvious reason for it. No = overloading, no cpu load, no disk load, no net load. Just an idle bootstrap= ping node and the existing nodes doing there jobs as always, also no overlo= ad. nodetool netstats reveals that there are stalled streams just lingering= around. When connecting via jconsole the thread shows mostly up as WAITING= seeming to be blocked by a read queue. Sometimes they catch up after an ar= bitrary time (minute, minutes, 30 minutes, an hour). Sometime they just tim= eout. In case of bootstrapping this induces another ugly behaviour: When th= e node continues to boot (and join) after a failed bootstrap (e.g. due to a= stream timeout), the node is marked as "UP" and clients start to query tha= t node but then the client throws exceptions: "Cannot read from a bootstrap= ping node". Maybe this is worth a separate ticket?=20 What I observed this morning and also maybe relates: The system.batches table(s) grew over and over. One node had a batchlog wit= h over 70GB, no sign that there was something in process or shrinking the l= ogs or that is something wrong (No down nodes, no "bad" logs). I didn't use= any batches from within our application since weeks, so I dared to truncat= e the logs with the CQL TRUNCATE command. Maybe an hour later I saw that re= pairs hung again. Then I recognized, that there were compactions on system.= batches going on on each node in the cluster, also hanging. There were no (= debug-)logs about that. See output of compactionstats: https://gist.github.= com/brstgt/6277764f6e34b0531b9bfc5392491280. After having restarted all nod= es, compaction of batches worked again. Is it possible that repair uses bat= ches internally so that this blocked a repair? If you got any more hints for me or need some more information I can provid= e, I am happy to do so :) Unfortunately I am quite new to c* and obviously I am dropping every existi= ng brick but I am willing to learn (I am currently eating blogs + books), t= o help and to get my f***** cluster up and running stable :D Thanks so far! > nodetool repair hangs > --------------------- > > Key: CASSANDRA-12280 > URL: https://issues.apache.org/jira/browse/CASSANDRA-1228= 0 > Project: Cassandra > Issue Type: Bug > Reporter: Benjamin Roth > > nodetool repair hangs when repairing a keyspace, does not hang when repai= rting table/mv by table/mv. > Command executed (both variants make it hang): > nodetool repair likes like dislike_by_source_mv like_by_contact_mv match_= valid_mv like_out dislike match match_by_contact_mv like_valid_mv like_out_= by_source_mv > OR > nodetool repair likes > Logs: > https://gist.github.com/brstgt/bf8b20fa1942d29ab60926ede7340b75 > Nodetool output: > https://gist.github.com/brstgt/3aa73662da4b0190630ac1aad6c90a6f > Schema: > https://gist.github.com/brstgt/3fd59e0166f86f8065085532e3638097 -- This message was sent by Atlassian JIRA (v6.3.4#6332)