cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Sicoie <robert.sic...@gmail.com>
Subject Re: How to get rid of "Cannot start multiple repair sessions over the same sstables" exception
Date Thu, 29 Sep 2016 12:28:47 GMT
Thanks Alexander,

After roll restart the blocked repair job stopped and I was able to run
repair again.

Regards,
Robert

Robert Sicoie

On Wed, Sep 28, 2016 at 6:46 PM, Alexander Dejanovski <
alex@thelastpickle.com> wrote:

> Robert,
>
> You can restart them in any order, that doesn't make a difference afaik.
>
> Cheers
>
> Le mer. 28 sept. 2016 17:10, Robert Sicoie <robert.sicoie@gmail.com> a
> écrit :
>
>> Thanks Alexander,
>>
>> Yes, with tpstats I can see the hanging active repair(s) (output
>> attached). For one there are 31 pending repair. On others there are less
>> pending repairs (min 12). Is there any recomandation for the restart order?
>> The one with more less pending repairs first, perhaps?
>>
>> Thanks,
>> Robert
>>
>> Robert Sicoie
>>
>> On Wed, Sep 28, 2016 at 5:35 PM, Alexander Dejanovski <
>> alex@thelastpickle.com> wrote:
>>
>>> They will show up in nodetool compactionstats :
>>> https://issues.apache.org/jira/browse/CASSANDRA-9098
>>>
>>> Did you check nodetool tpstats to see if you didn't have any running
>>> repair session ?
>>> Just to make sure (and if you can actually do it), roll restart the
>>> cluster and try again. Repair sessions can get sticky sometimes.
>>>
>>> On Wed, Sep 28, 2016 at 4:23 PM Robert Sicoie <robert.sicoie@gmail.com>
>>> wrote:
>>>
>>>> I am using nodetool compactionstats to check for pending compactions
>>>> and it shows me 0 pending on all nodes, seconds before running nodetool
>>>> repair.
>>>> I am also monitoring PendingCompactions on jmx.
>>>>
>>>> Is there other way I can find out if is there any anticompaction
>>>> running on any node?
>>>>
>>>> Thanks a lot,
>>>> Robert
>>>>
>>>> Robert Sicoie
>>>>
>>>> On Wed, Sep 28, 2016 at 4:44 PM, Alexander Dejanovski <
>>>> alex@thelastpickle.com> wrote:
>>>>
>>>>> Robert,
>>>>>
>>>>> you need to make sure you have no repair session currently running on
>>>>> your cluster, and no anticompaction.
>>>>> I'd recommend doing a rolling restart in order to stop all running
>>>>> repair for sure, then start the process again, node by node, checking
that
>>>>> no anticompaction is running before moving from one node to the other.
>>>>>
>>>>> Please do not use the -pr switch as it is both useless (token ranges
>>>>> are repaired only once with inc repair, whatever the replication factor)
>>>>> and harmful as all anticompactions won't be executed (you'll still have
>>>>> sstables marked as unrepaired even if the process has ran entirely with
no
>>>>> error).
>>>>>
>>>>> Let us know how that goes.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> On Wed, Sep 28, 2016 at 2:57 PM Robert Sicoie <robert.sicoie@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks Alexander,
>>>>>>
>>>>>> Now I started to run the repair with -pr arg and with keyspace and
>>>>>> table args.
>>>>>> Still, I got the "ERROR [RepairJobTask:1] 2016-09-28 11:34:38,288
>>>>>> RepairRunnable.java:246 - Repair session 89af4d10-856f-11e6-b28f-df99132d7979
>>>>>> for range [(8323429577695061526,8326640819362122791],
>>>>>> ..., (4212695343340915405,4229348077081465596]]] Validation failed
>>>>>> in /10.45.113.88"
>>>>>>
>>>>>> for one of the tables. 10.45.113.88 is the ip of the machine I am
>>>>>> running the nodetool on.
>>>>>> I'm wondering if this is normal...
>>>>>>
>>>>>> Thanks,
>>>>>> Robert
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Robert Sicoie
>>>>>>
>>>>>> On Wed, Sep 28, 2016 at 11:53 AM, Alexander Dejanovski <
>>>>>> alex@thelastpickle.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> nodetool scrub won't help here, as what you're experiencing is
most
>>>>>>> likely that one SSTable is going through anticompaction, and
then another
>>>>>>> node is asking for a Merkle tree that involves it.
>>>>>>> For understandable reasons, an SSTable cannot be anticompacted
and
>>>>>>> validation compacted at the same time.
>>>>>>>
>>>>>>> The solution here is to adjust the repair pressure on your cluster
>>>>>>> so that anticompaction can end before you run repair on another
node.
>>>>>>> You may have a lot of anticompaction to do if you had high volumes
>>>>>>> of unrepaired data, which can take a long time depending on several
factors.
>>>>>>>
>>>>>>> You can tune your repair process to make sure no anticompaction
is
>>>>>>> running before launching a new session on another node or you
can try my
>>>>>>> Reaper fork that handles incremental repair : https://github.com/
>>>>>>> adejanovski/cassandra-reaper/tree/inc-repair-support-with-ui
>>>>>>> I may have to add a few checks in order to avoid all collisions
>>>>>>> between anticompactions and new sessions, but it should be helpful
if you
>>>>>>> struggle with incremental repair.
>>>>>>>
>>>>>>> In any case, check if your nodes are still anticompacting before
>>>>>>> trying to run a new repair session on a node.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Sep 28, 2016 at 10:31 AM Robert Sicoie <
>>>>>>> robert.sicoie@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi guys,
>>>>>>>>
>>>>>>>> I have a cluster of 5 nodes, cassandra 3.0.5.
>>>>>>>> I was running nodetool repair last days, one node at a time,
when I
>>>>>>>> first encountered this exception
>>>>>>>>
>>>>>>>> *ERROR [ValidationExecutor:11] 2016-09-27 16:12:20,409
>>>>>>>> CassandraDaemon.java:195 - Exception in thread
>>>>>>>> Thread[ValidationExecutor:11,1,main]*
>>>>>>>> *java.lang.RuntimeException: Cannot start multiple repair
sessions
>>>>>>>> over the same sstables*
>>>>>>>> * at
>>>>>>>> org.apache.cassandra.db.compaction.CompactionManager.getSSTablesToValidate(CompactionManager.java:1194)
>>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>>>> * at
>>>>>>>> org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:1084)
>>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>>>> * at
>>>>>>>> org.apache.cassandra.db.compaction.CompactionManager.access$700(CompactionManager.java:80)
>>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>>>> * at
>>>>>>>> org.apache.cassandra.db.compaction.CompactionManager$10.call(CompactionManager.java:714)
>>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>>>> * at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>>>>>> ~[na:1.8.0_60]*
>>>>>>>> * at
>>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>>>> ~[na:1.8.0_60]*
>>>>>>>> * at
>>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>>>> [na:1.8.0_60]*
>>>>>>>> * at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]*
>>>>>>>>
>>>>>>>> On some of the other boxes I see this:
>>>>>>>>
>>>>>>>>
>>>>>>>> *Caused by: org.apache.cassandra.exceptions.RepairException:
>>>>>>>> [repair #9dd21ab0-83f4-11e6-b28f-df99132d7979 on notes/operator_source_mv,
>>>>>>>> [(-7505573573695693981,-7495786486761919991],*
>>>>>>>> *....*
>>>>>>>> * (-8483612809930827919,-8480482504800860871]]] Validation
failed
>>>>>>>> in /10.45.113.67 <http://10.45.113.67>*
>>>>>>>> * at
>>>>>>>> org.apache.cassandra.repair.ValidationTask.treesReceived(ValidationTask.java:68)
>>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>>>> * at
>>>>>>>> org.apache.cassandra.repair.RepairSession.validationComplete(RepairSession.java:183)
>>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>>>> * at
>>>>>>>> org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:408)
>>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>>>> * at
>>>>>>>> org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:168)
>>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>>>> * at org.apache.cassandra.net
>>>>>>>> <http://org.apache.cassandra.net>.MessageDeliveryTask.run(MessageDeliveryTask.java:67)
>>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>>>> * at
>>>>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>>>>>>> ~[na:1.8.0_60]*
>>>>>>>> * at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>>>>>> ~[na:1.8.0_60]*
>>>>>>>> * at
>>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>>>> [na:1.8.0_60]*
>>>>>>>> * at
>>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>>>> [na:1.8.0_60]*
>>>>>>>> * at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]*
>>>>>>>> *ERROR [RepairJobTask:3] 2016-09-26 16:39:33,096
>>>>>>>> CassandraDaemon.java:195 - Exception in thread Thread[RepairJobTask:3,5,RMI
>>>>>>>> Runtime]*
>>>>>>>> *java.lang.AssertionError: java.lang.InterruptedException*
>>>>>>>> * at org.apache.cassandra.net
>>>>>>>> <http://org.apache.cassandra.net>.OutboundTcpConnection.enqueue(OutboundTcpConnection.java:172)
>>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>>>> * at org.apache.cassandra.net
>>>>>>>> <http://org.apache.cassandra.net>.MessagingService.sendOneWay(MessagingService.java:761)
>>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>>>> * at org.apache.cassandra.net
>>>>>>>> <http://org.apache.cassandra.net>.MessagingService.sendOneWay(MessagingService.java:729)
>>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>>>> * at
>>>>>>>> org.apache.cassandra.repair.ValidationTask.run(ValidationTask.java:56)
>>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>>>> * at
>>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>>>> ~[na:1.8.0_60]*
>>>>>>>> * at
>>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>>>> ~[na:1.8.0_60]*
>>>>>>>> * at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_60]*
>>>>>>>> *Caused by: java.lang.InterruptedException: null*
>>>>>>>> * at
>>>>>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
>>>>>>>> ~[na:1.8.0_60]*
>>>>>>>> * at
>>>>>>>> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
>>>>>>>> ~[na:1.8.0_60]*
>>>>>>>> * at
>>>>>>>> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
>>>>>>>> ~[na:1.8.0_60]*
>>>>>>>> * at org.apache.cassandra.net
>>>>>>>> <http://org.apache.cassandra.net>.OutboundTcpConnection.enqueue(OutboundTcpConnection.java:168)
>>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>>>> * ... 6 common frames omitted*
>>>>>>>>
>>>>>>>>
>>>>>>>> Now if I run nodetool repair I get the
>>>>>>>>
>>>>>>>> *java.lang.RuntimeException: Cannot start multiple repair
sessions
>>>>>>>> over the same sstables*
>>>>>>>>
>>>>>>>> exception.
>>>>>>>> What do you suggest? would nodetool scrub or sstablescrub
help in
>>>>>>>> this case. or it would just make it worse?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Robert
>>>>>>>>
>>>>>>> --
>>>>>>> -----------------
>>>>>>> Alexander Dejanovski
>>>>>>> France
>>>>>>> @alexanderdeja
>>>>>>>
>>>>>>> Consultant
>>>>>>> Apache Cassandra Consulting
>>>>>>> http://www.thelastpickle.com
>>>>>>>
>>>>>>
>>>>>> --
>>>>> -----------------
>>>>> Alexander Dejanovski
>>>>> France
>>>>> @alexanderdeja
>>>>>
>>>>> Consultant
>>>>> Apache Cassandra Consulting
>>>>> http://www.thelastpickle.com
>>>>>
>>>>
>>>> --
>>> -----------------
>>> Alexander Dejanovski
>>> France
>>> @alexanderdeja
>>>
>>> Consultant
>>> Apache Cassandra Consulting
>>> http://www.thelastpickle.com
>>>
>>
>> --
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>

Mime
View raw message