Hi.

We are faced with strange problem where Cassandra nodes lose each other only one day of week, on friday, in exactly 14:50 PM, within several months.

On that time each node periodically reports that other nodes are dead.

At same time nodes are working fine.

This continues about one hour, after that cluster stabilizes.

Low CPU load.

 

There are several snippets of log file from one node:

 

TRACE [GossipTasks:1] 2011-12-02 15:12:51,829 FailureDetector.java (line 149) PHI for /192.168.68.228 : 38.154333610365036

INFO [GossipTasks:1] 2011-12-02 15:12:51,829 Gossiper.java (line 229) InetAddress /192.168.68.228 is now dead.

 

...

 

DEBUG [NonPeriodicTasks:1] 2011-12-02 15:12:51,845 ColumnFamilyStore.java (line 819) forceFlush requested but everything is clean

INFO [ScheduledTasks:1] 2011-12-02 15:12:51,829 StatusLogger.java (line 66) ReadRepairStage                   0         0         0

TRACE [GossipTasks:1] 2011-12-02 15:12:51,829 FailureDetector.java (line 149) PHI for /192.168.68.227 : -0.0

DEBUG [NonPeriodicTasks:1] 2011-12-02 15:12:51,845 ColumnFamilyStore.java (line 819) forceFlush requested but everything is clean

TRACE [GossipStage:1] 2011-12-02 15:12:51,845 FailureDetector.java (line 128) reporting /192.168.68.229

DEBUG [NonPeriodicTasks:1] 2011-12-02 15:12:51,845 ColumnFamilyStore.java (line 819) forceFlush requested but everything is clean

TRACE [GossipTasks:1] 2011-12-02 15:12:51,845 FailureDetector.java (line 149) PHI for /192.168.68.224 : 0.019569070233147485

INFO [ScheduledTasks:1] 2011-12-02 15:12:51,845 StatusLogger.java (line 66) MutationStage                     0         0         0

TRACE [GossipTasks:1] 2011-12-02 15:12:51,845 FailureDetector.java (line 149) PHI for /192.168.68.226 : 37.966339304199074

DEBUG [NonPeriodicTasks:1] 2011-12-02 15:12:51,845 ColumnFamilyStore.java (line 819) forceFlush requested but everything is clean

TRACE [GossipStage:1] 2011-12-02 15:12:51,845 FailureDetector.java (line 128) reporting /192.168.68.228

DEBUG [NonPeriodicTasks:1] 2011-12-02 15:12:51,845 ColumnFamilyStore.java (line 819) forceFlush requested but everything is clean

INFO [GossipTasks:1] 2011-12-02 15:12:51,845 Gossiper.java (line 229) InetAddress /192.168.68.226 is now dead.

 

...

 

TRACE [GossipTasks:1] 2011-12-02 15:13:03,898 FailureDetector.java (line 149) PHI for /192.168.68.228 : 7.7043961801903045

TRACE [GossipTasks:1] 2011-12-02 15:13:03,898 FailureDetector.java (line 149) PHI for /192.168.68.223 : 7.585990557120916

TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI for /192.168.68.227 : 7.922553972766636

TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI for /192.168.68.224 : 7.798568512691048

TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI for /192.168.68.226 : 7.8425064901177715

TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI for /192.168.68.225 : 4.592224429445155

TRACE [GossipTasks:1] 2011-12-02 15:13:03,900 FailureDetector.java (line 149) PHI for /192.168.68.222 : 8.06856164053645

INFO [GossipTasks:1] 2011-12-02 15:13:03,900 Gossiper.java (line 229) InetAddress /192.168.68.222 is now dead.

DEBUG [GossipTasks:1] 2011-12-02 15:13:03,900 MessagingService.java (line 153) Resetting pool for /192.168.68.222

TRACE [GossipTasks:1] 2011-12-02 15:13:03,901 FailureDetector.java (line 149) PHI for /192.168.68.229 : 7.645354417332889

TRACE [GossipTasks:1] 2011-12-02 15:13:03,901 FailureDetector.java (line 149) PHI for /192.168.68.230 : 7.775610031554557

 

...

 

TRACE [GossipTasks:1] 2011-12-02 15:13:03,898 FailureDetector.java (line 149) PHI for /192.168.68.228 : 7.7043961801903045

TRACE [GossipTasks:1] 2011-12-02 15:13:03,898 FailureDetector.java (line 149) PHI for /192.168.68.223 : 7.585990557120916

TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI for /192.168.68.227 : 7.922553972766636

TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI for /192.168.68.224 : 7.798568512691048

TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI for /192.168.68.226 : 7.8425064901177715

TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI for /192.168.68.225 : 4.592224429445155

TRACE [GossipTasks:1] 2011-12-02 15:13:03,900 FailureDetector.java (line 149) PHI for /192.168.68.222 : 8.06856164053645

INFO [GossipTasks:1] 2011-12-02 15:13:03,900 Gossiper.java (line 229) InetAddress /192.168.68.222 is now dead.

DEBUG [GossipTasks:1] 2011-12-02 15:13:03,900 MessagingService.java (line 153) Resetting pool for /192.168.68.222

TRACE [GossipTasks:1] 2011-12-02 15:13:03,901 FailureDetector.java (line 149) PHI for /192.168.68.229 : 7.645354417332889

TRACE [GossipTasks:1] 2011-12-02 15:13:03,901 FailureDetector.java (line 149) PHI for /192.168.68.230 : 7.775610031554557

TRACE [GossipTasks:1] 2011-12-02 15:13:04,903 Gossiper.java (line 307) Gossip Digests are : /192.168.68.221:1322136327:682506 /192.168.68.223:1322116132:702923 /192.168.68.222:1322116089:702938 /192.168.68.228:1322116156:702981 /192.168.68.225:1322817130:31 /192.168.68.230:1322116110:702870 /192.168.68.226:1322116095:702557 /192.168.68.221:1322136327:682506 /192.168.68.224:1322116106:702922 /192.168.68.227:1322116098:702974 /192.168.68.229:1322116107:702950

TRACE [GossipTasks:1] 2011-12-02 15:13:04,903 Gossiper.java (line 360) Sending a GossipDigestSynMessage to /192.168.68.224 ...

TRACE [GossipTasks:1] 2011-12-02 15:13:04,903 Gossiper.java (line 360) Sending a GossipDigestSynMessage to /192.168.68.228 ...

TRACE [GossipTasks:1] 2011-12-02 15:13:04,903 Gossiper.java (line 101) Performing status check ...

TRACE [GossipTasks:1] 2011-12-02 15:13:04,904 FailureDetector.java (line 149) PHI for /192.168.68.228 : 8.350335221549706

TRACE [GossipTasks:1] 2011-12-02 15:13:04,904 FailureDetector.java (line 149) PHI for /192.168.68.223 : 8.222055442973863

INFO [GossipTasks:1] 2011-12-02 15:13:04,904 Gossiper.java (line 229) InetAddress /192.168.68.223 is now dead.

 

The same picture on other nodes.

 

Cassandra version 7.8.

OS Windows server 2008R2.

Cluster size 10 nodes.

Replication factor 5.

 

Best regards,

Konstantin Chernyakov.