cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Roth (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-12280) nodetool repair hangs
Date Mon, 15 Aug 2016 16:15:22 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-12280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15421200#comment-15421200
] 

Benjamin Roth commented on CASSANDRA-12280:
-------------------------------------------

Some traces of hanging repairs (better say hanging streams):

A repair that hung and ended in a broken pipe:
- Trace, netstats, compactionstats: http://pastebin.com/sFhe1NpZ

Trace of another run of the same range-repair, parallel, hung about 23 minutes, finished successful:
- Trace: https://cl.ly/1s3b2F3o3900

Trace of another run of the same range, sequential, was run when network was (artificially,
using iperf) completely saturated:
- Network graphs: https://cl.ly/0A030X2m463z / https://cl.ly/2F2E412i2Q07
- Trace: https://cl.ly/2b3y1C1O243k
It completed much faster even though it was run sequential AND network was fully saturated
- had just shorter streaming lags.

These are only a few examples.

Is it possible that there exist some blocking / deadlock scenarios in streaming? 
I don't claim that our network stack ist 100% perfectly tuned but it is very very unlikely
that these pauses are caused by the network layer or overloaded disks / cpus. I applied most
of the suggested sysctl parameters from Al's Tuning guide (https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html).
Also I am able to easily shove 700-900 Mbit/s between the affected nodes additional to C*
running in normal operation.
To be sure that there is no filesystem issue, I copied all SSTables for that CF over the network
(around 13GB) to that host which is also part of the repair job - worked as expected, throughput
90-100MB/s.

I am aware that streaming is much more than transferring some files. As far is I know up to
know, C* is using the normal dataflow during a stream (memtable > sstable > compaction
...) but a stream that hangs around for many minutes without an obvious reason is really obscure.
I also checked the CPU / Alloc stats of the affected nodes with sjk-plus. Also here no obvious
activity like StreamReceiverTask, Compaction, ... only normal operation activity. It behaves
just like if there is a stale lock lingering around somewhere.

Anything more I can do?

> nodetool repair hangs
> ---------------------
>
>                 Key: CASSANDRA-12280
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12280
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Benjamin Roth
>
> nodetool repair hangs when repairing a keyspace, does not hang when repairting table/mv
by table/mv.
> Command executed (both variants make it hang):
> nodetool repair likes like dislike_by_source_mv like_by_contact_mv match_valid_mv like_out
dislike match match_by_contact_mv like_valid_mv like_out_by_source_mv
> OR
> nodetool repair likes
> Logs:
> https://gist.github.com/brstgt/bf8b20fa1942d29ab60926ede7340b75
> Nodetool output:
> https://gist.github.com/brstgt/3aa73662da4b0190630ac1aad6c90a6f
> Schema:
> https://gist.github.com/brstgt/3fd59e0166f86f8065085532e3638097



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message