Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 81E8C200C03 for ; Sat, 7 Jan 2017 00:55:00 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 80915160B4E; Fri, 6 Jan 2017 23:55:00 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id D154D160B39 for ; Sat, 7 Jan 2017 00:54:59 +0100 (CET) Received: (qmail 60202 invoked by uid 500); 6 Jan 2017 23:54:59 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 60163 invoked by uid 99); 6 Jan 2017 23:54:58 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 Jan 2017 23:54:58 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id C02EF2C03DE for ; Fri, 6 Jan 2017 23:54:58 +0000 (UTC) Date: Fri, 6 Jan 2017 23:54:58 +0000 (UTC) From: "Danil Smirnov (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (CASSANDRA-13095) Timeouts between nodes MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 06 Jan 2017 23:55:00 -0000 [ https://issues.apache.org/jira/browse/CASSANDRA-13095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danil Smirnov updated CASSANDRA-13095: -------------------------------------- Attachment: 13095-2.1.patch > Timeouts between nodes > ---------------------- > > Key: CASSANDRA-13095 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13095 > Project: Cassandra > Issue Type: Bug > Reporter: Danil Smirnov > Priority: Minor > Attachments: 13095-2.1.patch > > > Recently I've run into a problem with heavily loaded cluster when sometimes messages between certain nodes become blocked with no reason. > It looks like the same situation that described here https://issues.apache.org/jira/browse/CASSANDRA-12676?focusedCommentId=15736166&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15736166 > Thread dump showed infinite loop here: https://github.com/apache/cassandra/blob/a8a43dd32eb92406d7d8b105e08c68b3d5c7df49/src/java/org/apache/cassandra/utils/CoalescingStrategies.java#L109 > Apparently the problem is in the initial value of epoch filed in TimeHorizonMovingAverageCoalescingStrategy class. When it's value is not evenly divisible by BUCKET_INTERVAL, ix(epoch-1) does not point to the correct bucket. As a result, sum gradually increases and, upon reaching MEASURED_INTERVAL, averageGap becomes 0 and thread blocks. > It's hard to reproduce because it takes a long time for sum to grow and when no messages are send for some time, sum becomes 0 https://github.com/apache/cassandra/blob/a8a43dd32eb92406d7d8b105e08c68b3d5c7df49/src/java/org/apache/cassandra/utils/CoalescingStrategies.java#L301 and bug is no longer reproducible (until connection between nodes is re-created). > I've added a patch which should fix the problem. Don't know if it would be of any help since CASSANDRA-12676 will apparently disable this behaviour. One note about performance regressions though. There is a small chance it being result of the bug described here, so it might be worth testing performance after fixes and/or tuning the algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)