Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Mon, 15 Aug 2016 08:46:20 +0000 (UTC)
From: "Benjamin Roth (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12991934.1469217160000.303456.1471250780756@Atlassian.JIRA>
In-Reply-To: <JIRA.12991934.1469217160000@Atlassian.JIRA>
References: <JIRA.12991934.1469217160000@Atlassian.JIRA> <JIRA.12991934.1469217160194@arcas>
Subject: [jira] [Comment Edited] (CASSANDRA-12280) nodetool repair hangs
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
archived-at: Mon, 15 Aug 2016 08:46:22 -0000


    [ https://issues.apache.org/jira/browse/CASSANDRA-12280?page=3Dcom.atla=
ssian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=
=3D15420727#comment-15420727 ]=20

Benjamin Roth edited comment on CASSANDRA-12280 at 8/15/16 8:46 AM:
--------------------------------------------------------------------

First thanks for your hints. I guess there are multiple issues related to t=
hat.
The first problem I spotted was that some nodes were "GC'ed to death" under=
 load. E.g. on a rebuild, bootstrap or a big repair.
I could fix that by changing some settings like memtable_flush_writers, con=
current_compactors, increasing the heap a bit (xmx,xms) and the young gener=
ation (xmn). So this is actually not a bug, but after having read a loooooo=
t of resources it seems the default settings are not unconditionally workin=
g for larger setups. I read several blogs where buys mentioned that e.g. th=
e 100MB/core xmn setting is "wrong" / "outdated".

When something hangs, it is always a stream that hangs. I guess it is proba=
bly waiting for "something" but I have to get more details first. I dont as=
sume it is due to network overload.=20

Maybe offtopic but also maybe relates:
The behaviour can be observed best on rebuild or bootstrap where there are =
a few large streaming plans. The throughput is very high in the beginning (=
up to 200mbit IN on the bootstrapping node) and degrades after a short time=
 (I guess when memtables start flushing) then it continues with 60-100 mbit=
/s until one or more streams start to stall. Then throughput and load go co=
mpletely down. Unfortunately I could not find an obvious reason for it. No =
overloading, no cpu load, no disk load, no net load. Just an idle bootstrap=
ping node and the existing nodes doing there jobs as always, also no overlo=
ad. nodetool netstats reveals that there are stalled streams just lingering=
 around. When connecting via jconsole the thread shows mostly up as WAITING=
 seeming to be blocked by ArrayBlockingQueue. Sometimes they catch up after=
 an arbitrary time (minute, minutes, 30 minutes, an hour). Sometime they ju=
st timeout. In case of bootstrapping this induces another ugly behaviour: W=
hen the node continues to boot (and join) after a failed bootstrap (e.g. du=
e to a stream timeout), the node is marked as "UP" and clients start to que=
ry that node but then the client throws exceptions: "Cannot read from a boo=
tstrapping node". Maybe this is worth a separate ticket?=20

What I observed this morning and also maybe relates:
The system.batches table(s) grew over and over. One node had a batchlog wit=
h over 70GB, no sign that there was something in process or shrinking the l=
ogs or that is something wrong (No down nodes, no "bad" logs). I didn't use=
 any batches from within our application since weeks, so I dared to truncat=
e the logs with the CQL TRUNCATE command. Maybe an hour later I saw that re=
pairs hung again. Then I recognized, that there were compactions on system.=
batches going on on each node in the cluster, also hanging. There were no (=
debug-)logs about that. See output of compactionstats: https://gist.github.=
com/brstgt/6277764f6e34b0531b9bfc5392491280. After having restarted all nod=
es, compaction of batches worked again. Is it possible that repair uses bat=
ches internally so that this blocked a repair?

If you got any more hints for me or need some more information I can provid=
e, I am happy to do so :)
Unfortunately I am quite new to c* and obviously I am dropping every existi=
ng brick but I am willing to learn (I am currently eating blogs + books), t=
o help and to get my f***** cluster up and running stable :D

Thanks so far!


was (Author: brstgt):
First thanks for your hints. I guess there are multiple issues related to t=
hat.
The first problem I spotted was that some nodes were "GC'ed to death" under=
 load. E.g. on a rebuild, bootstrap or a big repair.
I could fix that by changing some settings like memtable_flush_writers, con=
current_compactors, increasing the heap a bit (xmx,xms) and the young gener=
ation (xmn). So this is actually not a bug, but after having read a loooooo=
t of resources it seems the default settings are not unconditionally workin=
g for larger setups. I read several blogs where buys mentioned that e.g. th=
e 100MB/core xmn setting is "wrong" / "outdated".

When something hangs, it is always a stream that hangs. I guess it is proba=
bly waiting for "something" but I have to get more details first. I dont as=
sume it is due to network overload.=20

Maybe offtopic but also maybe relates:
The behaviour can be observed best on rebuild or bootstrap where there are =
a few large streaming plans. The throughput is very high in the beginning (=
up to 200mbit IN on the bootstrapping node) and degrades after a short time=
 (I guess when memtables start flushing) then it continues with 60-100 mbit=
/s until one or more streams start to stall. Then throughput and load go co=
mpletely down. Unfortunately I could not find an obvious reason for it. No =
overloading, no cpu load, no disk load, no net load. Just an idle bootstrap=
ping node and the existing nodes doing there jobs as always, also no overlo=
ad. nodetool netstats reveals that there are stalled streams just lingering=
 around. When connecting via jconsole the thread shows mostly up as WAITING=
 seeming to be blocked by a read queue. Sometimes they catch up after an ar=
bitrary time (minute, minutes, 30 minutes, an hour). Sometime they just tim=
eout. In case of bootstrapping this induces another ugly behaviour: When th=
e node continues to boot (and join) after a failed bootstrap (e.g. due to a=
 stream timeout), the node is marked as "UP" and clients start to query tha=
t node but then the client throws exceptions: "Cannot read from a bootstrap=
ping node". Maybe this is worth a separate ticket?=20

What I observed this morning and also maybe relates:
The system.batches table(s) grew over and over. One node had a batchlog wit=
h over 70GB, no sign that there was something in process or shrinking the l=
ogs or that is something wrong (No down nodes, no "bad" logs). I didn't use=
 any batches from within our application since weeks, so I dared to truncat=
e the logs with the CQL TRUNCATE command. Maybe an hour later I saw that re=
pairs hung again. Then I recognized, that there were compactions on system.=
batches going on on each node in the cluster, also hanging. There were no (=
debug-)logs about that. See output of compactionstats: https://gist.github.=
com/brstgt/6277764f6e34b0531b9bfc5392491280. After having restarted all nod=
es, compaction of batches worked again. Is it possible that repair uses bat=
ches internally so that this blocked a repair?

If you got any more hints for me or need some more information I can provid=
e, I am happy to do so :)
Unfortunately I am quite new to c* and obviously I am dropping every existi=
ng brick but I am willing to learn (I am currently eating blogs + books), t=
o help and to get my f***** cluster up and running stable :D

Thanks so far!

> nodetool repair hangs
> ---------------------
>
>                 Key: CASSANDRA-12280
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1228=
0
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Benjamin Roth
>
> nodetool repair hangs when repairing a keyspace, does not hang when repai=
rting table/mv by table/mv.
> Command executed (both variants make it hang):
> nodetool repair likes like dislike_by_source_mv like_by_contact_mv match_=
valid_mv like_out dislike match match_by_contact_mv like_valid_mv like_out_=
by_source_mv
> OR
> nodetool repair likes
> Logs:
> https://gist.github.com/brstgt/bf8b20fa1942d29ab60926ede7340b75
> Nodetool output:
> https://gist.github.com/brstgt/3aa73662da4b0190630ac1aad6c90a6f
> Schema:
> https://gist.github.com/brstgt/3fd59e0166f86f8065085532e3638097


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)