cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Mills <...@bitbrew.com>
Subject Re: repair failed
Date Thu, 02 Jan 2020 19:13:52 GMT
Hi Oliver,

I don't have a quick answer (or any answer yet), though we ran into a
similar issue and I'm wondering about your environment and some configs.

- Operating system?
- Cloud or on-premise?
- Version of Cassandra?
- Version of Java?
- Compaction strategy?
- Primarily read or primarily write (or a blend of both)?
- How much memory allocated to heap?
- How long do all the repair commands typically take per node?

nodetool repair -full -dcpar will stream data across data centers - is it
possible that the number of nodes, or the amount of data, or the number of
keyspaces has grown enough over time to cause streaming issues (and
timeouts)?

You wrote:

Is it problematic if the repair is started only on one node?

Are you asking whether it's ok to run -full repairs one node at a time (on
all nodes)? Or are you saying that you are only repairing one node in each
cluster or DC?

Thanks,
Ben




On Sun, Dec 29, 2019 at 3:54 AM gloCalHelp.com <www_8ems_com@sina.com>
wrote:

> TO Oliver :
>    Maybe repair should be executed after all data in MEMTBL are all
> flushed into harddisk?
>
>
> Sincerely yours,
> Georgelin
> www_8ems_com@sina.com
> mobile:0086 180 5986 1565
>
>
> ----- 原始邮件 -----
> 发件人:Oliver Herrmann <o.herrmann217@gmail.com>
> 收件人:user@cassandra.apache.org
> 主题:repair failed
> 日期:2019年12月28日 23点15分
>
> Hello,
>
> today the second time our weekly repair job failed which was working for
> many month without a problem. We are having multiple Cassandra nodes in two
> data center.
>
> The repair command is started only on one node with the following
> parameters:
>
> nodetool repair -full -dcpar
>
> Is it problematic if the repair is started only on one node?
>
> The repair fails after one hour with the following error message:
>
>  failed with error Could not create snapshot at /192.168.13.232
> (progress: 0%)
> [2019-12-28 05:00:04,295] Some repair failed
> [2019-12-28 05:00:04,296] Repair command #1 finished in 1 hour 0 minutes 2
> seconds
> error: Repair job has failed with the error message: [2019-12-28
> 05:00:04,295] Some repair failed
> -- StackTrace --
> java.lang.RuntimeException: Repair job has failed with the error message:
> [2019-12-28 05:00:04,295] Some repair failed
>         at
> org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:116)
>         at
> org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
>         at
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(Unknown
> Source)
>         at
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(Unknown
> Source)
>         at
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(Unknown
> Source)
>         at
> com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(Unknown
> Source)
>
> In the logfile on 192.168.13.232 which is in the second data center I
> could find only in debug.log the following log messages:
> DEBUG [COMMIT-LOG-ALLOCATOR] 2019-12-28 04:21:20,143
> AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating
> a fresh one
> DEBUG [MessagingService-Outgoing-192.168.13.120-Small] 2019-12-28
> 04:31:00,450 OutboundTcpConnection.java:410 - Socket to 192.168.13.120
>  closed
> DEBUG [MessagingService-Outgoing-192.168.13.120-Small] 2019-12-28
> 04:31:00,450 OutboundTcpConnection.java:349 - Error writing to 192.168
> .13.120
> java.io.IOException: Connection timed out
>         at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> ~[na:1.8.0_111]
>
> We tried to run repair a few more times but it always failed with the same
> error. After restarting all nodes it was finally successful.
>
> Any idea what could be wrong?
>
> Regards
> Oliver
>

Mime
View raw message