cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dominik Keil (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-10389) Repair session exception Validation failed
Date Wed, 17 Feb 2016 21:05:18 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-10389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15151180#comment-15151180
] 

Dominik Keil commented on CASSANDRA-10389:
------------------------------------------

I think we're seeing this issue as well. Running Cassandra 2.2.5. Haven't tried restarting
all nodes but will do that now.

We're running incremental repairs (now default, eh?) and while testing this before we put
that into production we already found that repairing a whole keyspace will create a massive
amount of open filehandles / "anti-compacted" sstables even though the repair will still only
work one CF at a time. This caused some problems so we're now running repairs one CF at a
time and on only one node at a time.

We did not have this issue in our testing but seing it in production now, nevertheless. What's
interesting is that the node, on which the repair runs, at some point suddenly thrashes its
heap (i.e. full heap usage, 65%-85% GC!!!) while at the same time produces huge amounts of
tiny, concurrent reads, leading to really bad read latency from disk and a lot of I/O wait.

The bad thing is: This (Cassandra) node becomes so unresponsive that it significantly impacts
the performance of the whole cluster (a total of 9 machines, rf 5 / quorum for most reads/writes,
rf 2 / one for less important bulk data). So neither the java driver nor the other nodes,
when being coordinator, manage to just leave this node alone for a while. As soon as I disable
gossip on this node, the rest of the cluster is fine again.

[~slebresne]: I applaud you for your very useful comment.

> Repair session exception Validation failed
> ------------------------------------------
>
>                 Key: CASSANDRA-10389
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10389
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Debian 8, Java 1.8.0_60, Cassandra 2.2.1 (datastax compilation)
>            Reporter: Jędrzej Sieracki
>             Fix For: 2.2.x
>
>
> I'm running a repair on a ring of nodes, that was recently extented from 3 to 13 nodes.
The extension was done two days ago, the repair was attempted yesterday.
> {quote}
> [2015-09-22 11:55:55,266] Starting repair command #9, repairing keyspace perspectiv with
repair options (parallelism: parallel, primary range: false, incremental: true, job threads:
1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 517)
> [2015-09-22 11:55:58,043] Repair session 1f7c50c0-6110-11e5-b992-9f13fa8664c8 for range
(-5927186132136652665,-5917344746039874798] failed with error [repair #1f7c50c0-6110-11e5-b992-9f13fa8664c8
on perspectiv/stock_increment_agg, (-5927186132136652665,-5917344746039874798]] Validation
failed in cblade1.XXX/XXX (progress: 0%)
> {quote}
> BTW, I am ignoring the LEAK errors for now, that's outside of the scope of the main issue:
> {quote}
> ERROR [Reference-Reaper:1] 2015-09-22 11:58:27,843 Ref.java:187 - LEAK DETECTED: a reference
(org.apache.cassandra.utils.concurrent.Ref$State@4d25ad8f) to class org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@896826067:/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-73-big
was not released before the reference was garbage collected
> {quote}
> I scrubbed the sstable with failed validation on cblade1 with nodetool scrub perspectiv
stock_increment_agg:
> {quote}
> INFO  [CompactionExecutor:1704] 2015-09-22 12:05:31,615 OutputHandler.java:42 - Scrubbing
BigTableReader(path='/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-83-big-Data.db')
(345466609 bytes)
> INFO  [CompactionExecutor:1703] 2015-09-22 12:05:31,615 OutputHandler.java:42 - Scrubbing
BigTableReader(path='/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-82-big-Data.db')
(60496378 bytes)
> ERROR [Reference-Reaper:1] 2015-09-22 12:05:31,676 Ref.java:187 - LEAK DETECTED: a reference
(org.apache.cassandra.utils.concurrent.Ref$State@4ca8951e) to class org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@114161559:/var/lib/cassandra/data/perspectiv/receipt_agg_total-76abb0625de711e59f6e0b7d98a25b6e/la-48-big
was not released before the reference was garbage collected
> ERROR [Reference-Reaper:1] 2015-09-22 12:05:31,676 Ref.java:187 - LEAK DETECTED: a reference
(org.apache.cassandra.utils.concurrent.Ref$State@eeb6383) to class org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@1612685364:/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-83-big
was not released before the reference was garbage collected
> ERROR [Reference-Reaper:1] 2015-09-22 12:05:31,676 Ref.java:187 - LEAK DETECTED: a reference
(org.apache.cassandra.utils.concurrent.Ref$State@1de90543) to class org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@2058626950:/var/lib/cassandra/data/perspectiv/receipt_agg_total-76abb0625de711e59f6e0b7d98a25b6e/la-49-big
was not released before the reference was garbage collected
> ERROR [Reference-Reaper:1] 2015-09-22 12:05:31,676 Ref.java:187 - LEAK DETECTED: a reference
(org.apache.cassandra.utils.concurrent.Ref$State@15616385) to class org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@1386628428:/var/lib/cassandra/data/perspectiv/receipt_agg_total-76abb0625de711e59f6e0b7d98a25b6e/la-47-big
was not released before the reference was garbage collected
> INFO  [CompactionExecutor:1703] 2015-09-22 12:05:35,098 OutputHandler.java:42 - Scrub
of BigTableReader(path='/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-82-big-Data.db')
complete: 51397 rows in new sstable and 0 empty (tombstoned) rows dropped
> INFO  [CompactionExecutor:1704] 2015-09-22 12:05:47,605 OutputHandler.java:42 - Scrub
of BigTableReader(path='/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-83-big-Data.db')
complete: 292600 rows in new sstable and 0 empty (tombstoned) rows dropped
> {quote}
> Now, after scrubbing, another repair was attempted, it did finish, but with lots of errors
from other nodes:
> {quote}
> [2015-09-22 12:01:18,020] Repair session db476b51-6110-11e5-b992-9f13fa8664c8 for range
(5019296454787813261,5021512586040808168] failed with error [repair #db476b51-6110-11e5-b992-9f13fa8664c8
on perspectiv/stock_increment_agg, (5019296454787813261,5021512586040808168]] Validation failed
in /10.YYY (progress: 91%)
> [2015-09-22 12:01:18,079] Repair session db482ea1-6110-11e5-b992-9f13fa8664c8 for range
(-3660233266780784242,-3638577078894365342] failed with error [repair #db482ea1-6110-11e5-b992-9f13fa8664c8
on perspectiv/stock_increment_agg, (-3660233266780784242,-3638577078894365342]] Validation
failed in /10.XXX (progress: 92%)
> [2015-09-22 12:01:18,276] Repair session db4a0361-6110-11e5-b992-9f13fa8664c8 for range
(9158857758535272856,9167427882441871745] failed with error [repair #db4a0361-6110-11e5-b992-9f13fa8664c8
on perspectiv/stock_increment_agg, (9158857758535272856,9167427882441871745]] Validation failed
in /10.YYY (progress: 95%)
> {quote}
> After scrubbing stock_increment_agg on all nodes, just to be sure, the repair still failed,
this time with the following exception:
> {quote}
> INFO  [Repair#16:50] 2015-09-22 12:08:47,471 RepairJob.java:181 - [repair #ea123bf3-6111-11e5-b992-9f13fa8664c8]
Requesting merkle trees for stock_increment_agg (to [/10.60.77.202, cblade1.XXX/XXX])
> ERROR [RepairJobTask:1] 2015-09-22 12:08:47,471 RepairSession.java:290 - [repair #ea123bf0-6111-11e5-b992-9f13fa8664c8]
Session completed with the following error
> org.apache.cassandra.exceptions.RepairException: [repair #ea123bf0-6111-11e5-b992-9f13fa8664c8
on perspectiv/stock_increment_agg, (355657753119264326,366309649129068298]] Validation failed
in cblade1.
>         at org.apache.cassandra.repair.ValidationTask.treeReceived(ValidationTask.java:64)
~[apache-cassandra-2.2.1.jar:2.2.1]
>         at org.apache.cassandra.repair.RepairSession.validationComplete(RepairSession.java:183)
~[apache-cassandra-2.2.1.jar:2.2.1]
>         at org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:399)
~[apache-cassandra-2.2.1.jar:2.2.1]
>         at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:158)
~[apache-cassandra-2.2.1.jar:2.2.1]
>         at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66)
~[apache-cassandra-2.2.1.jar:2.2.1]
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[na:1.8.0_60]
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[na:1.8.0_60]
>         at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message