cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Heiko Sommer (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-10389) Repair session exception Validation failed
Date Thu, 16 Jun 2016 09:10:05 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-10389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333412#comment-15333412
] 

Heiko Sommer edited comment on CASSANDRA-10389 at 6/16/16 9:09 AM:
-------------------------------------------------------------------

I'm getting the same problem with Cassandra 2.2.5, cluster of 6 nodes, RF=2. 
As a workaround I must restart all nodes before running a repair. 

For sure I do not start multiple repairs simultaneously. Here is what happened the last time
I tried it out: The previous incremental repair ("{{nodetool repair --partitioner-range --
mykeyspace}}") started on a single node after rolling cluster restart finished nicely, with
the expected number of "Session completed successfully" logs. There were no more repair tasks
or anticompaction tasks running, the cluster was stable. I restarted C* on 4 nodes, but left
it running on 2 nodes. On one of the restarted nodes I ran an incremental repair again, this
time also with the "{{--sequential}}" option. 
On the repairing node I get failure logs such as
{noformat}
java.lang.RuntimeException: Could not create snapshot at /10.195.62.171
        at org.apache.cassandra.repair.SnapshotTask$SnapshotCallback.onFailure(SnapshotTask.java:79)
~[apache-cassandra-2.2.5.jar:2.2.5]
ERROR [Repair#1:16] 2016-06-16 07:10:29,239 CassandraDaemon.java:185 - Exception in thread
Thread[Repair#1:16,5,RMI Runtime]
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException:
Could not create snapshot at /10.195.62.171
        at com.google.common.util.concurrent.Futures.wrapAndThrowUnchecked(Futures.java:1387)
~[guava-16.0.jar:na]
{noformat}
while on the failing target nodes (those that were not restarted before the repair) I get
logs such as
{noformat}
ERROR [AntiEntropyStage:1] 2016-06-16 07:10:29,237 RepairMessageVerbHandler.java:108 - Cannot
start multiple repair sessions over the same sstables
{noformat}

Before that, I also tried with full repair, and got the impression that it is the same problem
for full or incremental repairs. 
As I can reproduce the issue, I would be glad to provide you with more logs or some experimenting
if that would help resolve the issue. 


was (Author: hsommer):
I'm getting the same problem with Cassandra 2.2.5, cluster of 6 nodes, RF=2. 
As a workaround I must restart all nodes before running a repair. 

For sure I do not start multiple repairs simultaneously. Here is what happened the last time
I tried it out: The previous incremental repair ("nodetool repair --partitioner-range -- mykeyspace")
started on a single node after rolling cluster restart finished nicely, with the expected
number of "Session completed successfully" logs. There were no more repair tasks or anticompaction
tasks running, the cluster was stable. I restarted C* on 4 nodes, but left it running on 2
nodes. On one of the restarted nodes I ran an incremental repair again, this time also with
the "--sequential" option. 
On the repairing node I get failure logs such as
{noformat}
java.lang.RuntimeException: Could not create snapshot at /10.195.62.171
        at org.apache.cassandra.repair.SnapshotTask$SnapshotCallback.onFailure(SnapshotTask.java:79)
~[apache-cassandra-2.2.5.jar:2.2.5]
ERROR [Repair#1:16] 2016-06-16 07:10:29,239 CassandraDaemon.java:185 - Exception in thread
Thread[Repair#1:16,5,RMI Runtime]
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException:
Could not create snapshot at /10.195.62.171
        at com.google.common.util.concurrent.Futures.wrapAndThrowUnchecked(Futures.java:1387)
~[guava-16.0.jar:na]
{noformat}
while on the failing target nodes (those that were not restarted before the repair) I get
logs such as
{noformat}
ERROR [AntiEntropyStage:1] 2016-06-16 07:10:29,237 RepairMessageVerbHandler.java:108 - Cannot
start multiple repair sessions over the same sstables
{noformat}

Before that, I also tried with full repair, and got the impression that it is the same problem
for full or incremental repairs. 
As I can reproduce the issue, I would be glad to provide you with more logs or some experimenting
if that would help resolve the issue. 

> Repair session exception Validation failed
> ------------------------------------------
>
>                 Key: CASSANDRA-10389
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10389
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Debian 8, Java 1.8.0_60, Cassandra 2.2.1 (datastax compilation)
>            Reporter: Jędrzej Sieracki
>             Fix For: 2.2.x
>
>
> I'm running a repair on a ring of nodes, that was recently extented from 3 to 13 nodes.
The extension was done two days ago, the repair was attempted yesterday.
> {quote}
> [2015-09-22 11:55:55,266] Starting repair command #9, repairing keyspace perspectiv with
repair options (parallelism: parallel, primary range: false, incremental: true, job threads:
1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 517)
> [2015-09-22 11:55:58,043] Repair session 1f7c50c0-6110-11e5-b992-9f13fa8664c8 for range
(-5927186132136652665,-5917344746039874798] failed with error [repair #1f7c50c0-6110-11e5-b992-9f13fa8664c8
on perspectiv/stock_increment_agg, (-5927186132136652665,-5917344746039874798]] Validation
failed in cblade1.XXX/XXX (progress: 0%)
> {quote}
> BTW, I am ignoring the LEAK errors for now, that's outside of the scope of the main issue:
> {quote}
> ERROR [Reference-Reaper:1] 2015-09-22 11:58:27,843 Ref.java:187 - LEAK DETECTED: a reference
(org.apache.cassandra.utils.concurrent.Ref$State@4d25ad8f) to class org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@896826067:/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-73-big
was not released before the reference was garbage collected
> {quote}
> I scrubbed the sstable with failed validation on cblade1 with nodetool scrub perspectiv
stock_increment_agg:
> {quote}
> INFO  [CompactionExecutor:1704] 2015-09-22 12:05:31,615 OutputHandler.java:42 - Scrubbing
BigTableReader(path='/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-83-big-Data.db')
(345466609 bytes)
> INFO  [CompactionExecutor:1703] 2015-09-22 12:05:31,615 OutputHandler.java:42 - Scrubbing
BigTableReader(path='/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-82-big-Data.db')
(60496378 bytes)
> ERROR [Reference-Reaper:1] 2015-09-22 12:05:31,676 Ref.java:187 - LEAK DETECTED: a reference
(org.apache.cassandra.utils.concurrent.Ref$State@4ca8951e) to class org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@114161559:/var/lib/cassandra/data/perspectiv/receipt_agg_total-76abb0625de711e59f6e0b7d98a25b6e/la-48-big
was not released before the reference was garbage collected
> ERROR [Reference-Reaper:1] 2015-09-22 12:05:31,676 Ref.java:187 - LEAK DETECTED: a reference
(org.apache.cassandra.utils.concurrent.Ref$State@eeb6383) to class org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@1612685364:/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-83-big
was not released before the reference was garbage collected
> ERROR [Reference-Reaper:1] 2015-09-22 12:05:31,676 Ref.java:187 - LEAK DETECTED: a reference
(org.apache.cassandra.utils.concurrent.Ref$State@1de90543) to class org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@2058626950:/var/lib/cassandra/data/perspectiv/receipt_agg_total-76abb0625de711e59f6e0b7d98a25b6e/la-49-big
was not released before the reference was garbage collected
> ERROR [Reference-Reaper:1] 2015-09-22 12:05:31,676 Ref.java:187 - LEAK DETECTED: a reference
(org.apache.cassandra.utils.concurrent.Ref$State@15616385) to class org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@1386628428:/var/lib/cassandra/data/perspectiv/receipt_agg_total-76abb0625de711e59f6e0b7d98a25b6e/la-47-big
was not released before the reference was garbage collected
> INFO  [CompactionExecutor:1703] 2015-09-22 12:05:35,098 OutputHandler.java:42 - Scrub
of BigTableReader(path='/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-82-big-Data.db')
complete: 51397 rows in new sstable and 0 empty (tombstoned) rows dropped
> INFO  [CompactionExecutor:1704] 2015-09-22 12:05:47,605 OutputHandler.java:42 - Scrub
of BigTableReader(path='/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-83-big-Data.db')
complete: 292600 rows in new sstable and 0 empty (tombstoned) rows dropped
> {quote}
> Now, after scrubbing, another repair was attempted, it did finish, but with lots of errors
from other nodes:
> {quote}
> [2015-09-22 12:01:18,020] Repair session db476b51-6110-11e5-b992-9f13fa8664c8 for range
(5019296454787813261,5021512586040808168] failed with error [repair #db476b51-6110-11e5-b992-9f13fa8664c8
on perspectiv/stock_increment_agg, (5019296454787813261,5021512586040808168]] Validation failed
in /10.YYY (progress: 91%)
> [2015-09-22 12:01:18,079] Repair session db482ea1-6110-11e5-b992-9f13fa8664c8 for range
(-3660233266780784242,-3638577078894365342] failed with error [repair #db482ea1-6110-11e5-b992-9f13fa8664c8
on perspectiv/stock_increment_agg, (-3660233266780784242,-3638577078894365342]] Validation
failed in /10.XXX (progress: 92%)
> [2015-09-22 12:01:18,276] Repair session db4a0361-6110-11e5-b992-9f13fa8664c8 for range
(9158857758535272856,9167427882441871745] failed with error [repair #db4a0361-6110-11e5-b992-9f13fa8664c8
on perspectiv/stock_increment_agg, (9158857758535272856,9167427882441871745]] Validation failed
in /10.YYY (progress: 95%)
> {quote}
> After scrubbing stock_increment_agg on all nodes, just to be sure, the repair still failed,
this time with the following exception:
> {quote}
> INFO  [Repair#16:50] 2015-09-22 12:08:47,471 RepairJob.java:181 - [repair #ea123bf3-6111-11e5-b992-9f13fa8664c8]
Requesting merkle trees for stock_increment_agg (to [/10.60.77.202, cblade1.XXX/XXX])
> ERROR [RepairJobTask:1] 2015-09-22 12:08:47,471 RepairSession.java:290 - [repair #ea123bf0-6111-11e5-b992-9f13fa8664c8]
Session completed with the following error
> org.apache.cassandra.exceptions.RepairException: [repair #ea123bf0-6111-11e5-b992-9f13fa8664c8
on perspectiv/stock_increment_agg, (355657753119264326,366309649129068298]] Validation failed
in cblade1.
>         at org.apache.cassandra.repair.ValidationTask.treeReceived(ValidationTask.java:64)
~[apache-cassandra-2.2.1.jar:2.2.1]
>         at org.apache.cassandra.repair.RepairSession.validationComplete(RepairSession.java:183)
~[apache-cassandra-2.2.1.jar:2.2.1]
>         at org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:399)
~[apache-cassandra-2.2.1.jar:2.2.1]
>         at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:158)
~[apache-cassandra-2.2.1.jar:2.2.1]
>         at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66)
~[apache-cassandra-2.2.1.jar:2.2.1]
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[na:1.8.0_60]
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[na:1.8.0_60]
>         at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message