Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 1D984200C60 for ; Mon, 24 Apr 2017 16:33:11 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 1C468160B99; Mon, 24 Apr 2017 14:33:11 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 3F722160B8F for ; Mon, 24 Apr 2017 16:33:10 +0200 (CEST) Received: (qmail 35238 invoked by uid 500); 24 Apr 2017 14:33:09 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 35226 invoked by uid 99); 24 Apr 2017 14:33:09 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 24 Apr 2017 14:33:09 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id E13D91AFE5E for ; Mon, 24 Apr 2017 14:33:08 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id JXBeBcba198C for ; Mon, 24 Apr 2017 14:33:07 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 4C4F95FE5B for ; Mon, 24 Apr 2017 14:33:06 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 37C65E0A34 for ; Mon, 24 Apr 2017 14:33:05 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 6751B21B5F for ; Mon, 24 Apr 2017 14:33:04 +0000 (UTC) Date: Mon, 24 Apr 2017 14:33:04 +0000 (UTC) From: "Christian Esken (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 24 Apr 2017 14:33:11 -0000 [ https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981236#comment-15981236 ] Christian Esken edited comment on CASSANDRA-13265 at 4/24/17 2:32 PM: ---------------------------------------------------------------------- First here is a summary and the question I have: The tests work if I add "DatabaseDescriptor.daemonInitialization();" to the unit test of the affected branches. Is this a good idea, [~aweisberg]? Now the long story: This is the status for branch cassandra-13265-3.0: - (/) Running unit tests in Eclipse: Works - (/)/(?) CircleCI: All normal tests work fine. "Your build ran 4754 tests in junit with 0 failures". The build fails for me with: Target "stress-test" does not exist in the project "apache-cassandra". As "ant test" worked, I would guess that the patch is fine. I will reverify the specific unit test locally This is the status for branch cassandra-13265-3.11 and cassandra-13265-trunk: - (/) Running unit tests in Eclipse: Works - (x) Running unit tests with CircleCI or "ant test" fails, due to non-initialized DatabaseDescriptor. When I add the following to the unit test of cassandra-13265-3.11, the unit test works. {code} DatabaseDescriptor.daemonInitialization(); {code} {code} [junit] Null Test: Caused an ERROR [junit] null [junit] java.lang.ExceptionInInitializerError [junit] at java.lang.Class.forName0(Native Method) [junit] at java.lang.Class.forName(Class.java:264) [junit] Caused by: java.lang.NullPointerException [junit] at org.apache.cassandra.config.DatabaseDescriptor.getWriteRpcTimeout(DatabaseDescriptor.java:1400) [junit] at org.apache.cassandra.net.MessagingService$Verb$1.getTimeout(MessagingService.java:121) [junit] at org.apache.cassandra.net.OutboundTcpConnectionTest.(OutboundTcpConnectionTest.java:43) {code} was (Author: cesken): First here is the summary: The tests work if I add "DatabaseDescriptor.daemonInitialization();" to the unit test of the affected branches. Is this a good idea, [~aweisberg]? Now the long story: This is the status for branch cassandra-13265-3.0: - (/) Running unit tests in Eclipse: Works - (/)/(?) CircleCI: All normal tests work fine. "Your build ran 4754 tests in junit with 0 failures". The build fails for me with: Target "stress-test" does not exist in the project "apache-cassandra". As "ant test" worked, I would guess that the patch is fine. I will reverify the specific unit test locally This is the status for branch cassandra-13265-3.11 and cassandra-13265-trunk: - (/) Running unit tests in Eclipse: Works - (x) Running unit tests with CircleCI or "ant test" fails, due to non-initialized DatabaseDescriptor. When I add the following to the unit test of cassandra-13265-3.11, the unit test works. {code} DatabaseDescriptor.daemonInitialization(); {code} {code} [junit] Null Test: Caused an ERROR [junit] null [junit] java.lang.ExceptionInInitializerError [junit] at java.lang.Class.forName0(Native Method) [junit] at java.lang.Class.forName(Class.java:264) [junit] Caused by: java.lang.NullPointerException [junit] at org.apache.cassandra.config.DatabaseDescriptor.getWriteRpcTimeout(DatabaseDescriptor.java:1400) [junit] at org.apache.cassandra.net.MessagingService$Verb$1.getTimeout(MessagingService.java:121) [junit] at org.apache.cassandra.net.OutboundTcpConnectionTest.(OutboundTcpConnectionTest.java:43) {code} > Expiration in OutboundTcpConnection can block the reader Thread > --------------------------------------------------------------- > > Key: CASSANDRA-13265 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13265 > Project: Cassandra > Issue Type: Bug > Environment: Cassandra 3.0.9 > Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 1.8.0_112-b15) > Linux 3.16 > Reporter: Christian Esken > Assignee: Christian Esken > Fix For: 3.0.x > > Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz > > > I observed that sometimes a single node in a Cassandra cluster fails to communicate to the other nodes. This can happen at any time, during peak load or low load. Restarting that single node from the cluster fixes the issue. > Before going in to details, I want to state that I have analyzed the situation and am already developing a possible fix. Here is the analysis so far: > - A Threaddump in this situation showed 324 Threads in the OutboundTcpConnection class that want to lock the backlog queue for doing expiration. > - A class histogram shows 262508 instances of OutboundTcpConnection$QueuedMessage. > What is the effect of it? As soon as the Cassandra node has reached a certain amount of queued messages, it starts thrashing itself to death. Each of the Thread fully locks the Queue for reading and writing by calling iterator.next(), making the situation worse and worse. > - Writing: Only after 262508 locking operation it can progress with actually writing to the Queue. > - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and fully lock the Queue > This means: Writing blocks the Queue for reading, and readers might even be starved which makes the situation even worse. > ----- > The setup is: > - 3-node cluster > - replication factor 2 > - Consistency LOCAL_ONE > - No remote DC's > - high write throughput (100000 INSERT statements per second and more during peak times). > -- This message was sent by Atlassian JIRA (v6.3.15#6346)