Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 78179180B0 for ; Mon, 13 Jul 2015 22:59:05 +0000 (UTC) Received: (qmail 78554 invoked by uid 500); 13 Jul 2015 22:59:05 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 78527 invoked by uid 500); 13 Jul 2015 22:59:05 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 78515 invoked by uid 99); 13 Jul 2015 22:59:05 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Jul 2015 22:59:05 +0000 Date: Mon, 13 Jul 2015 22:59:05 +0000 (UTC) From: "Brandon Williams (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (CASSANDRA-9793) Log when messages are dropped due to cross_node_timeout MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Brandon Williams created CASSANDRA-9793: ------------------------------------------- Summary: Log when messages are dropped due to cross_node_timeout Key: CASSANDRA-9793 URL: https://issues.apache.org/jira/browse/CASSANDRA-9793 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Brandon Williams Fix For: 2.1.x, 2.0.x When a node has clock skew and cross node timeouts are enabled, there's no indication that the messages were dropped due to the cross timeout, just that messages were dropped. This can errantly lead you down a path of troubleshooting a load shedding situation when really you just have clock drift on one node. This is also not simple to troubleshooting, since you have to determine that this node will answer requests, but other nodes won't answer requests from it. If the problem goes away on a reboot (and the machine does one-shot time sync, not continuos) it becomes even harder to detect because you're left with a weird piece of evidence such as "it's fine after a reboot, but comes back in about X days every time." It would help tremendously if there were a log message indicating how many messages (don't need them broken down by type) were eagerly dropped due to the cross node timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)