cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ZAIDI, ASAD A" <>
Subject RE: Repair failed and crash the node, how to bring it back?
Date Thu, 01 Aug 2019 13:39:38 GMT
I don’t think anyone can predict with certainty if instance won’t crash but there are good
chances it will -  unless you take remedial actions.
If you are not doing subrange repair, a lot of merkle tree data can potentially be scanned/streamed
taking toll on memory resources – that , taking  account of all other running operations
, easily bust available memory.

You can do few things like – as short term measure – increase allotted heap size along
with running subrange repair with script<>
or by using reaper tool.
You may also want to check partition sizes of tables (nodetool tablestats) if they’re bloated.
See if table scans  are infested with lots of tombstones which in turn also tax on heap consumption.
My $.002 cents for the moment.

From: Martin Xue []
Sent: Wednesday, July 31, 2019 5:05 PM
Subject: Re: Repair failed and crash the node, how to bring it back?

Hi Alex,

Thanks for your reply. The disk space was around 80%. The crash happened during repair, primary
range full repair on 1TB keyspace.

Would that crash again?


On Thu., 1 Aug. 2019, 12:04 am Alexander Dejanovski, <<>>
It looks like you have a corrupted hint file.
Did the node run out of disk space while repair was running?

You might want to move the hint files off their current directory and try to restart the node
Since you'll have lost mutations then, you'll need... to run repair ¯\_(ツ)_/¯

Alexander Dejanovski

Apache Cassandra Consulting<>

On Wed, Jul 31, 2019 at 3:51 PM Martin Xue <<>>

I am running repair on production, started with one of 6 nodes in the cluster (3 nodes in
each of two DC). Cassandra version 3.0.14.

running: repair -pr --full keyspace on node 1, 1TB data, takes two days, and crash,

error shows:
3202]] finished (progress: 3%)
Exception occurred during clean-up. java.lang.reflect.UndeclaredThrowableException
Cassandra has shutdown.
error: [2019-07-31 20:19:20,797] JMX connection closed. You should check server log for repair
status of keyspace keyspace_masked (Subsequent keyspaces are not going to be repaired).
-- StackTrace -- [2019-07-31 20:19:20,797] JMX connection closed. You should check server
log for repair status of keyspace keyspace_masked keyspaces are not going to be repaired).
        at org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(
        at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.fetchOneNotif(
        at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.fetchNotifs(
        at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(
        at com.sun.jmx.remote.internal.ClientNotifForwarder$
        at com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$

system.log shows
INFO  [Service Thread] 2019-07-31 20:19:08,579 - G1 Young Generation
GC in 2915ms.  G1 Eden Space: 914358272 -> 0; G1 Old Gen: 19043999248 -> 20219035248;
INFO  [Service Thread] 2019-07-31 20:19:08,579 - Pool Name          
         Active   Pending      Completed   Blocked  All Time Blocked
INFO  [Service Thread] 2019-07-31 20:19:08,584 - MutationStage      
             19        15     9578177305         0                 0

INFO  [Service Thread] 2019-07-31 20:19:08,585 - ViewMutationStage  
              0         0              0         0                 0

INFO  [Service Thread] 2019-07-31 20:19:08,585 - ReadStage          
             10         0      219357504         0                 0

INFO  [Service Thread] 2019-07-31 20:19:08,585 - RequestResponseStage
             1         0      625174550         0                 0

INFO  [Service Thread] 2019-07-31 20:19:08,585 - ReadRepairStage    
              0         0        2544772         0                 0

INFO  [Service Thread] 2019-07-31 20:19:08,585 - CounterMutationStage
             0         0              0         0                 0

INFO  [Service Thread] 2019-07-31 20:19:08,585 - MiscStage          
              0         0              0         0                 0

INFO  [Service Thread] 2019-07-31 20:19:08,586 - CompactionExecutor 
              1         1        9515493         0                 0

When I restart the cassandra, it still failed,
now the error in system.log shows:

INFO  [main] 2019-07-31 21:35:02,044 - Cassandra version: 3.0.14
INFO  [main] 2019-07-31 21:35:02,044 - Thrift API version: 20.1.0
INFO  [main] 2019-07-31 21:35:02,044 - CQL supported versions: 3.4.0
(default: 3.4.0)
ERROR [main] 2019-07-31 21:35:02,075 - Exception encountered during
        at org.apache.cassandra.hints.HintsDescriptor.readFromFile(
        at$3$1.accept( ~[na:1.8.0_171]
        at$2$1.accept( ~[na:1.8.0_171]
        at java.util.Iterator.forEachRemaining( ~[na:1.8.0_171]
        at java.util.Spliterators$IteratorSpliterator.forEachRemaining(
        at ~[na:1.8.0_171]
        at ~[na:1.8.0_171]
        at$ReduceOp.evaluateSequential( ~[na:1.8.0_171]
        at ~[na:1.8.0_171]
        at ~[na:1.8.0_171]
        at org.apache.cassandra.hints.HintsCatalog.load( ~[apache-cassandra-3.0.14.jar:3.0.14]
        at org.apache.cassandra.hints.HintsService.<init>( ~[apache-cassandra-3.0.14.jar:3.0.14]
        at org.apache.cassandra.hints.HintsService.<clinit>( ~[apache-cassandra-3.0.14.jar:3.0.14]
        at org.apache.cassandra.service.StorageProxy.<clinit>(
        at java.lang.Class.forName0(Native Method) ~[na:1.8.0_171]
        at java.lang.Class.forName( ~[na:1.8.0_171]
        at org.apache.cassandra.service.StorageService.initServer(
        at org.apache.cassandra.service.StorageService.initServer(
        at org.apache.cassandra.service.CassandraDaemon.setup( [apache-cassandra-3.0.14.jar:3.0.14]
        at org.apache.cassandra.service.CassandraDaemon.activate(
        at org.apache.cassandra.service.CassandraDaemon.main( [apache-cassandra-3.0.14.jar:3.0.14]
Caused by: null
        at ~[na:1.8.0_171]
        at org.apache.cassandra.hints.HintsDescriptor.deserialize(
        at org.apache.cassandra.hints.HintsDescriptor.readFromFile(
        ... 20 common frames omitted

Can anyone help how to bring back the node again?

Also there are (anti-compaction after repair) running on other nodes, shall I stopped them
as well, if so how to do it (nodetool stop compaction?)?

Any suggestions will be much appreciated.


View raw message