Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 23E50791E for ; Mon, 5 Dec 2011 13:09:32 +0000 (UTC) Received: (qmail 63399 invoked by uid 500); 5 Dec 2011 13:09:29 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 63354 invoked by uid 500); 5 Dec 2011 13:09:28 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 63346 invoked by uid 99); 5 Dec 2011 13:09:28 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Dec 2011 13:09:28 +0000 X-ASF-Spam-Status: No, hits=1.6 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of kossoft00@gmail.com designates 209.85.215.172 as permitted sender) Received: from [209.85.215.172] (HELO mail-ey0-f172.google.com) (209.85.215.172) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Dec 2011 13:09:20 +0000 Received: by eaak10 with SMTP id k10so1171245eaa.31 for ; Mon, 05 Dec 2011 05:09:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; bh=+Mn/1RWsd93qBwD97dxUehoLrjEeicytIzPbFXgK7Uk=; b=N+vH9BMdhg7SJZWKkKRQjnHBnUMQTe/IQNaUhXKTYW0py8SEBoHLeluuV0o4PB363L G4KjHcXSrTVu64J42VfmuBg1PEUyYnvwiDA9QX97KJU76UdwwEoH1nGwS7Kl0zOnqgVk A5qZrZK57xfZtnL1NV7/4mY2thEtwANRZ3TUQ= MIME-Version: 1.0 Received: by 10.213.25.218 with SMTP id a26mr846126ebc.118.1323090539943; Mon, 05 Dec 2011 05:08:59 -0800 (PST) Received: by 10.14.27.197 with HTTP; Mon, 5 Dec 2011 05:08:59 -0800 (PST) Date: Mon, 5 Dec 2011 19:08:59 +0600 Message-ID: Subject: Strange problem with cassandra From: Konstantin Chernyakov To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=0015174bdc7ea3988604b35805cb X-Virus-Checked: Checked by ClamAV on apache.org --0015174bdc7ea3988604b35805cb Content-Type: text/plain; charset=ISO-8859-1 Hi. We are faced with strange problem where Cassandra nodes lose each other only one day of week, on friday, in exactly 14:50 PM, within several months. On that time each node periodically reports that other nodes are dead. At same time nodes are working fine. This continues about one hour, after that cluster stabilizes. Low CPU load. There are several snippets of log file from one node: TRACE [GossipTasks:1] 2011-12-02 15:12:51,829 FailureDetector.java (line 149) PHI for /192.168.68.228 : 38.154333610365036 INFO [GossipTasks:1] 2011-12-02 15:12:51,829 Gossiper.java (line 229) InetAddress /192.168.68.228 is now dead. ... DEBUG [NonPeriodicTasks:1] 2011-12-02 15:12:51,845 ColumnFamilyStore.java (line 819) forceFlush requested but everything is clean INFO [ScheduledTasks:1] 2011-12-02 15:12:51,829 StatusLogger.java (line 66) ReadRepairStage 0 0 0 TRACE [GossipTasks:1] 2011-12-02 15:12:51,829 FailureDetector.java (line 149) PHI for /192.168.68.227 : -0.0 DEBUG [NonPeriodicTasks:1] 2011-12-02 15:12:51,845 ColumnFamilyStore.java (line 819) forceFlush requested but everything is clean TRACE [GossipStage:1] 2011-12-02 15:12:51,845 FailureDetector.java (line 128) reporting /192.168.68.229 DEBUG [NonPeriodicTasks:1] 2011-12-02 15:12:51,845 ColumnFamilyStore.java (line 819) forceFlush requested but everything is clean TRACE [GossipTasks:1] 2011-12-02 15:12:51,845 FailureDetector.java (line 149) PHI for /192.168.68.224 : 0.019569070233147485 INFO [ScheduledTasks:1] 2011-12-02 15:12:51,845 StatusLogger.java (line 66) MutationStage 0 0 0 TRACE [GossipTasks:1] 2011-12-02 15:12:51,845 FailureDetector.java (line 149) PHI for /192.168.68.226 : 37.966339304199074 DEBUG [NonPeriodicTasks:1] 2011-12-02 15:12:51,845 ColumnFamilyStore.java (line 819) forceFlush requested but everything is clean TRACE [GossipStage:1] 2011-12-02 15:12:51,845 FailureDetector.java (line 128) reporting /192.168.68.228 DEBUG [NonPeriodicTasks:1] 2011-12-02 15:12:51,845 ColumnFamilyStore.java (line 819) forceFlush requested but everything is clean INFO [GossipTasks:1] 2011-12-02 15:12:51,845 Gossiper.java (line 229) InetAddress /192.168.68.226 is now dead. ... TRACE [GossipTasks:1] 2011-12-02 15:13:03,898 FailureDetector.java (line 149) PHI for /192.168.68.228 : 7.7043961801903045 TRACE [GossipTasks:1] 2011-12-02 15:13:03,898 FailureDetector.java (line 149) PHI for /192.168.68.223 : 7.585990557120916 TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI for /192.168.68.227 : 7.922553972766636 TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI for /192.168.68.224 : 7.798568512691048 TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI for /192.168.68.226 : 7.8425064901177715 TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI for /192.168.68.225 : 4.592224429445155 TRACE [GossipTasks:1] 2011-12-02 15:13:03,900 FailureDetector.java (line 149) PHI for /192.168.68.222 : 8.06856164053645 INFO [GossipTasks:1] 2011-12-02 15:13:03,900 Gossiper.java (line 229) InetAddress /192.168.68.222 is now dead. DEBUG [GossipTasks:1] 2011-12-02 15:13:03,900 MessagingService.java (line 153) Resetting pool for /192.168.68.222 TRACE [GossipTasks:1] 2011-12-02 15:13:03,901 FailureDetector.java (line 149) PHI for /192.168.68.229 : 7.645354417332889 TRACE [GossipTasks:1] 2011-12-02 15:13:03,901 FailureDetector.java (line 149) PHI for /192.168.68.230 : 7.775610031554557 ... TRACE [GossipTasks:1] 2011-12-02 15:13:03,898 FailureDetector.java (line 149) PHI for /192.168.68.228 : 7.7043961801903045 TRACE [GossipTasks:1] 2011-12-02 15:13:03,898 FailureDetector.java (line 149) PHI for /192.168.68.223 : 7.585990557120916 TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI for /192.168.68.227 : 7.922553972766636 TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI for /192.168.68.224 : 7.798568512691048 TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI for /192.168.68.226 : 7.8425064901177715 TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI for /192.168.68.225 : 4.592224429445155 TRACE [GossipTasks:1] 2011-12-02 15:13:03,900 FailureDetector.java (line 149) PHI for /192.168.68.222 : 8.06856164053645 INFO [GossipTasks:1] 2011-12-02 15:13:03,900 Gossiper.java (line 229) InetAddress /192.168.68.222 is now dead. DEBUG [GossipTasks:1] 2011-12-02 15:13:03,900 MessagingService.java (line 153) Resetting pool for /192.168.68.222 TRACE [GossipTasks:1] 2011-12-02 15:13:03,901 FailureDetector.java (line 149) PHI for /192.168.68.229 : 7.645354417332889 TRACE [GossipTasks:1] 2011-12-02 15:13:03,901 FailureDetector.java (line 149) PHI for /192.168.68.230 : 7.775610031554557 TRACE [GossipTasks:1] 2011-12-02 15:13:04,903 Gossiper.java (line 307) Gossip Digests are : /192.168.68.221:1322136327:682506 /192.168.68.223:1322116132:702923 /192.168.68.222:1322116089:702938 /192.168.68.228:1322116156:702981 /192.168.68.225:1322817130:31 /192.168.68.230:1322116110:702870 /192.168.68.226:1322116095:702557 /192.168.68.221:1322136327:682506 /192.168.68.224:1322116106:702922 /192.168.68.227:1322116098:702974 /192.168.68.229:1322116107:702950 TRACE [GossipTasks:1] 2011-12-02 15:13:04,903 Gossiper.java (line 360) Sending a GossipDigestSynMessage to /192.168.68.224 ... TRACE [GossipTasks:1] 2011-12-02 15:13:04,903 Gossiper.java (line 360) Sending a GossipDigestSynMessage to /192.168.68.228 ... TRACE [GossipTasks:1] 2011-12-02 15:13:04,903 Gossiper.java (line 101) Performing status check ... TRACE [GossipTasks:1] 2011-12-02 15:13:04,904 FailureDetector.java (line 149) PHI for /192.168.68.228 : 8.350335221549706 TRACE [GossipTasks:1] 2011-12-02 15:13:04,904 FailureDetector.java (line 149) PHI for /192.168.68.223 : 8.222055442973863 INFO [GossipTasks:1] 2011-12-02 15:13:04,904 Gossiper.java (line 229) InetAddress /192.168.68.223 is now dead. The same picture on other nodes. Cassandra version 7.8. OS Windows server 2008R2. Cluster size 10 nodes. Replication factor 5. Best regards, Konstantin Chernyakov. --0015174bdc7ea3988604b35805cb Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi.

We are faced with strange problem where Cassandra nodes lose each other only one d= ay of week, on friday, in exactly 14:50 PM, within several months.

On that time each node periodically reports that other nodes are dead.

At same time nodes are working fine.

This continues about one hour, after that cluster stabilizes.

Low CPU load.

=A0

There are several snippets of log file from one node:

=A0

TRACE [GossipTasks:1] 2011-12-02 15:12:51,829 FailureDetector.java (line 149) PHI= for /192.168.68.228 : 38.154333610365036<= /span>

INFO [GossipTasks:1] 2011-12-02 15:12:51,829 Gossiper.java (line 229) InetAddres= s /192.168.68.228 is now dead.

=A0

...

=A0

DEBUG [NonPeriodicTasks:1] 2011-12-02 15:12:51,845 ColumnFamilyStore.java (line 8= 19) forceFlush requested but everything is clean

INFO [ScheduledTasks:1] 2011-12-02 15:12:51,829 StatusLogger.java (line 66) ReadRepairStage=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0

TRACE [GossipTasks:1] 2011-12-02 15:12:51,829 FailureDetector.java (line 149) PHI= for /192.168.68.227 : -0.0

DEBUG [NonPeriodicTasks:1] 2011-12-02 15:12:51,845 ColumnFamilyStore.java (line 8= 19) forceFlush requested but everything is clean

TRACE [GossipStage:1] 2011-12-02 15:12:51,845 FailureDetector.java (line 128) reporting /192.168.68.229

DEBUG [NonPeriodicTasks:1] 2011-12-02 15:12:51,845 ColumnFamilyStore.java (line 8= 19) forceFlush requested but everything is clean

TRACE [GossipTasks:1] 2011-12-02 15:12:51,845 FailureDetector.java (line 149) PHI= for /192.168.68.224 : 0.01956907023314748= 5

INFO [ScheduledTasks:1] 2011-12-02 15:12:51,845 StatusLogger.java (line 66) MutationStage=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0

TRACE [GossipTasks:1] 2011-12-02 15:12:51,845 FailureDetector.java (line 149) PHI= for /192.168.68.226 : 37.966339304199074<= /span>

DEBUG [NonPeriodicTasks:1] 2011-12-02 15:12:51,845 ColumnFamilyStore.java (line 8= 19) forceFlush requested but everything is clean

TRACE [GossipStage:1] 2011-12-02 15:12:51,845 FailureDetector.java (line 128) reporting /192.168.68.228

DEBUG [NonPeriodicTasks:1] 2011-12-02 15:12:51,845 ColumnFamilyStore.java (line 8= 19) forceFlush requested but everything is clean

INFO [GossipTasks:1] 2011-12-02 15:12:51,845 Gossiper.java (line 229) InetAddres= s /192.168.68.226 is now dead.

=A0

...

=A0

TRACE [GossipTasks:1] 2011-12-02 15:13:03,898 FailureDetector.java (line 149) PHI= for /192.168.68.228 : 7.7043961801903045<= /span>

TRACE [GossipTasks:1] 2011-12-02 15:13:03,898 FailureDetector.java (line 149) PHI= for /192.168.68.223 : 7.585990557120916

TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI= for /192.168.68.227 : 7.922553972766636

TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI= for /192.168.68.224 : 7.798568512691048

TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI= for /192.168.68.226 : 7.8425064901177715<= /span>

TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI= for /192.168.68.225 : 4.592224429445155

TRACE [GossipTasks:1] 2011-12-02 15:13:03,900 FailureDetector.java (line 149) PHI= for /192.168.68.222 : 8.06856164053645

INFO [GossipTasks:1] 2011-12-02 15:13:03,900 Gossiper.java (line 229) InetAddres= s /192.168.68.222 is now dead.

DEBUG [GossipTasks:1] 2011-12-02 15:13:03,900 MessagingService.java (line 153) Resetting pool for /192.168.68.222

TRACE [GossipTasks:1] 2011-12-02 15:13:03,901 FailureDetector.java (line 149) PHI= for /192.168.68.229 : 7.645354417332889

TRACE [GossipTasks:1] 2011-12-02 15:13:03,901 FailureDetector.java (line 149) PHI= for /192.168.68.230 : 7.775610031554557

=A0

...

=A0

TRACE [GossipTasks:1] 2011-12-02 15:13:03,898 FailureDetector.java (line 149) PHI= for /192.168.68.228 : 7.7043961801903045<= /span>

TRACE [GossipTasks:1] 2011-12-02 15:13:03,898 FailureDetector.java (line 149) PHI= for /192.168.68.223 : 7.585990557120916

TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI for /192.168.68.227 : 7.922553972766636

TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI= for /192.168.68.224 : 7.798568512691048

TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI= for /192.168.68.226 : 7.8425064901177715<= /span>

TRACE [GossipTasks:1] 2011-12-02 15:13:03,899 FailureDetector.java (line 149) PHI= for /192.168.68.225 : 4.592224429445155

TRACE [GossipTasks:1] 2011-12-02 15:13:03,900 FailureDetector.java (line 149) PHI= for /192.168.68.222 : 8.06856164053645

INFO [GossipTasks:1] 2011-12-02 15:13:03,900 Gossiper.java (line 229) InetAddres= s /192.168.68.222 is now dead.

DEBUG [GossipTasks:1] 2011-12-02 15:13:03,900 MessagingService.java (line 153) Resetting pool for /192.168.68.222

TRACE [GossipTasks:1] 2011-12-02 15:13:03,901 FailureDetector.java (line 149) PHI= for /192.168.68.229 : 7.645354417332889

TRACE [GossipTasks:1] 2011-12-02 15:13:03,901 FailureDetector.java (line 149) PHI= for /192.168.68.230 : 7.775610031554557

TRACE [GossipTasks:1] 2011-12-02 15:13:04,903 Gossiper.java (line 307) Gossip Dig= ests are : /192.168.68.221:1322136327:682506 /192.168.68.223:1322116132:702923 /192.168.68.222:1322116089:702938 /192.168.68.228:1322116156:702981 /192.168.68.225:1322817130:31 /192.168.68.230:1322116110:702870 /192.168.68.226:1322116095:702557 /192.168.68.221:1322136327:682506 /192.168.68.224:1322116106:702922 /192.168.68.227:1322116098:702974 /192.168.68.229:1322116107:702950

TRACE [GossipTasks:1] 2011-12-02 15:13:04,903 Gossiper.java (line 360) Sending a GossipDigestSynMessage to /192.168.68.224= ...

TRACE [GossipTasks:1] 2011-12-02 15:13:04,903 Gossiper.java (line 360) Sending a GossipDigestSynMessage to /192.168.68.228= ...

TRACE [GossipTasks:1] 2011-12-02 15:13:04,903 Gossiper.java (line 101) Performing status check ..= .

TRACE [GossipTasks:1] 2011-12-02 15:13:04,904 FailureDetector.java (line 149) PHI= for /192.168.68.228 : 8.350335221549706

TRACE [GossipTasks:1] 2011-12-02 15:13:04,904 FailureDetector.java (line 149) PHI= for /192.168.68.223 : 8.222055442973863

INFO [GossipTasks:1] 2011-12-02 15:13:04,904 Gossiper.java (line 229) InetAddres= s /192.168.68.223 is now dead.

=A0

The same picture on other nodes.

=A0

Cassandra version 7.8.

OS Windows server 2008R2.

Cluster size 10 nodes.

Replication factor 5.

=A0

Best regards,

Konstantin Chernyakov.


--0015174bdc7ea3988604b35805cb--