cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Roth (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-12280) nodetool repair hangs
Date Mon, 15 Aug 2016 08:46:20 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-12280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420727#comment-15420727
] 

Benjamin Roth edited comment on CASSANDRA-12280 at 8/15/16 8:46 AM:
--------------------------------------------------------------------

First thanks for your hints. I guess there are multiple issues related to that.
The first problem I spotted was that some nodes were "GC'ed to death" under load. E.g. on
a rebuild, bootstrap or a big repair.
I could fix that by changing some settings like memtable_flush_writers, concurrent_compactors,
increasing the heap a bit (xmx,xms) and the young generation (xmn). So this is actually not
a bug, but after having read a loooooot of resources it seems the default settings are not
unconditionally working for larger setups. I read several blogs where buys mentioned that
e.g. the 100MB/core xmn setting is "wrong" / "outdated".

When something hangs, it is always a stream that hangs. I guess it is probably waiting for
"something" but I have to get more details first. I dont assume it is due to network overload.


Maybe offtopic but also maybe relates:
The behaviour can be observed best on rebuild or bootstrap where there are a few large streaming
plans. The throughput is very high in the beginning (up to 200mbit IN on the bootstrapping
node) and degrades after a short time (I guess when memtables start flushing) then it continues
with 60-100 mbit/s until one or more streams start to stall. Then throughput and load go completely
down. Unfortunately I could not find an obvious reason for it. No overloading, no cpu load,
no disk load, no net load. Just an idle bootstrapping node and the existing nodes doing there
jobs as always, also no overload. nodetool netstats reveals that there are stalled streams
just lingering around. When connecting via jconsole the thread shows mostly up as WAITING
seeming to be blocked by ArrayBlockingQueue. Sometimes they catch up after an arbitrary time
(minute, minutes, 30 minutes, an hour). Sometime they just timeout. In case of bootstrapping
this induces another ugly behaviour: When the node continues to boot (and join) after a failed
bootstrap (e.g. due to a stream timeout), the node is marked as "UP" and clients start to
query that node but then the client throws exceptions: "Cannot read from a bootstrapping node".
Maybe this is worth a separate ticket? 

What I observed this morning and also maybe relates:
The system.batches table(s) grew over and over. One node had a batchlog with over 70GB, no
sign that there was something in process or shrinking the logs or that is something wrong
(No down nodes, no "bad" logs). I didn't use any batches from within our application since
weeks, so I dared to truncate the logs with the CQL TRUNCATE command. Maybe an hour later
I saw that repairs hung again. Then I recognized, that there were compactions on system.batches
going on on each node in the cluster, also hanging. There were no (debug-)logs about that.
See output of compactionstats: https://gist.github.com/brstgt/6277764f6e34b0531b9bfc5392491280.
After having restarted all nodes, compaction of batches worked again. Is it possible that
repair uses batches internally so that this blocked a repair?

If you got any more hints for me or need some more information I can provide, I am happy to
do so :)
Unfortunately I am quite new to c* and obviously I am dropping every existing brick but I
am willing to learn (I am currently eating blogs + books), to help and to get my f***** cluster
up and running stable :D

Thanks so far!


was (Author: brstgt):
First thanks for your hints. I guess there are multiple issues related to that.
The first problem I spotted was that some nodes were "GC'ed to death" under load. E.g. on
a rebuild, bootstrap or a big repair.
I could fix that by changing some settings like memtable_flush_writers, concurrent_compactors,
increasing the heap a bit (xmx,xms) and the young generation (xmn). So this is actually not
a bug, but after having read a loooooot of resources it seems the default settings are not
unconditionally working for larger setups. I read several blogs where buys mentioned that
e.g. the 100MB/core xmn setting is "wrong" / "outdated".

When something hangs, it is always a stream that hangs. I guess it is probably waiting for
"something" but I have to get more details first. I dont assume it is due to network overload.


Maybe offtopic but also maybe relates:
The behaviour can be observed best on rebuild or bootstrap where there are a few large streaming
plans. The throughput is very high in the beginning (up to 200mbit IN on the bootstrapping
node) and degrades after a short time (I guess when memtables start flushing) then it continues
with 60-100 mbit/s until one or more streams start to stall. Then throughput and load go completely
down. Unfortunately I could not find an obvious reason for it. No overloading, no cpu load,
no disk load, no net load. Just an idle bootstrapping node and the existing nodes doing there
jobs as always, also no overload. nodetool netstats reveals that there are stalled streams
just lingering around. When connecting via jconsole the thread shows mostly up as WAITING
seeming to be blocked by a read queue. Sometimes they catch up after an arbitrary time (minute,
minutes, 30 minutes, an hour). Sometime they just timeout. In case of bootstrapping this induces
another ugly behaviour: When the node continues to boot (and join) after a failed bootstrap
(e.g. due to a stream timeout), the node is marked as "UP" and clients start to query that
node but then the client throws exceptions: "Cannot read from a bootstrapping node". Maybe
this is worth a separate ticket? 

What I observed this morning and also maybe relates:
The system.batches table(s) grew over and over. One node had a batchlog with over 70GB, no
sign that there was something in process or shrinking the logs or that is something wrong
(No down nodes, no "bad" logs). I didn't use any batches from within our application since
weeks, so I dared to truncate the logs with the CQL TRUNCATE command. Maybe an hour later
I saw that repairs hung again. Then I recognized, that there were compactions on system.batches
going on on each node in the cluster, also hanging. There were no (debug-)logs about that.
See output of compactionstats: https://gist.github.com/brstgt/6277764f6e34b0531b9bfc5392491280.
After having restarted all nodes, compaction of batches worked again. Is it possible that
repair uses batches internally so that this blocked a repair?

If you got any more hints for me or need some more information I can provide, I am happy to
do so :)
Unfortunately I am quite new to c* and obviously I am dropping every existing brick but I
am willing to learn (I am currently eating blogs + books), to help and to get my f***** cluster
up and running stable :D

Thanks so far!

> nodetool repair hangs
> ---------------------
>
>                 Key: CASSANDRA-12280
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12280
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Benjamin Roth
>
> nodetool repair hangs when repairing a keyspace, does not hang when repairting table/mv
by table/mv.
> Command executed (both variants make it hang):
> nodetool repair likes like dislike_by_source_mv like_by_contact_mv match_valid_mv like_out
dislike match match_by_contact_mv like_valid_mv like_out_by_source_mv
> OR
> nodetool repair likes
> Logs:
> https://gist.github.com/brstgt/bf8b20fa1942d29ab60926ede7340b75
> Nodetool output:
> https://gist.github.com/brstgt/3aa73662da4b0190630ac1aad6c90a6f
> Schema:
> https://gist.github.com/brstgt/3fd59e0166f86f8065085532e3638097



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message