Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7F19CC7AD for ; Fri, 14 Jun 2013 12:02:03 +0000 (UTC) Received: (qmail 13011 invoked by uid 500); 14 Jun 2013 12:02:01 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 12498 invoked by uid 500); 14 Jun 2013 12:01:56 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 12490 invoked by uid 99); 14 Jun 2013 12:01:56 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Jun 2013 12:01:56 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of haithem.jarraya@struq.com designates 209.85.128.177 as permitted sender) Received: from [209.85.128.177] (HELO mail-ve0-f177.google.com) (209.85.128.177) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Jun 2013 12:01:49 +0000 Received: by mail-ve0-f177.google.com with SMTP id cz10so386289veb.22 for ; Fri, 14 Jun 2013 05:01:28 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type :x-gm-message-state; bh=Lou9euXCR2DmzzaKoOTkkpMTPz3vmfxb9HHnpAT37CM=; b=a0XBa1eMelrH0EHHebEeTsAIQ0q6RCg/nhxfFNUIxBhVV9ZVonCFypIIROdemG06WR It+mbtMB5xu6d4hqkpnQYi/rZRGE1RniUSplkH7m/sI+Y3BUnH0lKUK1/TCvIv3M7pXu hfulmPpazxuTQHbGEwE6MsN4SXOQYlhbGFb1StXeWb0q3iIYPQvZuyxl7vnnwdn7PPlY lJP2aSum53vUZAt3qAr5Am/HHBZHEwDqfH6OK/xZ/fAYGuiE4K3fZ/7CNAY4W9+JMGtx mndJRUkm4SJlB/TGSQnjWnORXNzbGn6ALHGXn7HvhWHGLQajg+IkoWNW59wgRAnD98ZG 7bmw== MIME-Version: 1.0 X-Received: by 10.52.70.142 with SMTP id m14mr702127vdu.127.1371211288169; Fri, 14 Jun 2013 05:01:28 -0700 (PDT) Received: by 10.58.210.72 with HTTP; Fri, 14 Jun 2013 05:01:28 -0700 (PDT) Date: Fri, 14 Jun 2013 13:01:28 +0100 Message-ID: Subject: Thrift threads proliferation From: Haithem Jarraya To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=20cf307f3a78be60a504df1c019d X-Gm-Message-State: ALoCoQlRw+eUqi2YcNxEKO1O0rovkRtvDDDJjYAjLgsatcFwvkw/SYGKRXtmWC1otCD+beXJddfV X-Virus-Checked: Checked by ClamAV on apache.org --20cf307f3a78be60a504df1c019d Content-Type: text/plain; charset=ISO-8859-1 Hi All, We are facing a very strange issue in our C* ring. We are using C* v1.2.4, 7 Nodes in DC1, 3 Nodes in DC2 and 3 Nodes in DC3. We have been testing read/write performances in DC1, by having different disks configurations. For instance we have node1-DC1 use JBOD and node2-DC1 is using RAID-0 configuration. Over the last week everything seems to be running fine until yesterday when node2-DC1 (RAID-0) config stop responding to client requests and timing out queries. JMX console showed up to 25k Thrift threads running, no pending compaction running, a lot of pending reads and that's it, CPU is averaging at 10% heap usage is about 4GB of the 8GB available. Node2-DC1 become unresponsive but still other node were trying to query it and it was not flagged as dead or unresponsive from Gossip messages, wondering if it's a bug. Log file shows after stopping thrift from node2-DC1: INFO [ScheduledTasks:1] 2013-06-14 09:29:37,433 StatusLogger.java (line 53) Pool Name Active Pending Blocked INFO [ScheduledTasks:1] 2013-06-14 09:29:37,564 StatusLogger.java (line 68) ReadStage 30 959 0 INFO [ScheduledTasks:1] 2013-06-14 09:29:37,564 StatusLogger.java (line 68) RequestResponseStage 0 0 0 INFO [ScheduledTasks:1] 2013-06-14 09:29:37,565 StatusLogger.java (line 68) ReadRepairStage 0 0 0 INFO [ScheduledTasks:1] 2013-06-14 09:29:37,565 StatusLogger.java (line 68) MutationStage 0 0 0 INFO [ScheduledTasks:1] 2013-06-14 09:29:37,566 StatusLogger.java (line 68) ReplicateOnWriteStage 0 0 0 INFO [ScheduledTasks:1] 2013-06-14 09:29:37,566 StatusLogger.java (line 68) GossipStage 0 0 0 INFO [ScheduledTasks:1] 2013-06-14 09:29:37,567 StatusLogger.java (line 68) AntiEntropyStage 0 0 0 INFO [ScheduledTasks:1] 2013-06-14 09:29:37,567 StatusLogger.java (line 68) MigrationStage 0 0 0 INFO [ScheduledTasks:1] 2013-06-14 09:29:37,568 StatusLogger.java (line 68) MemtablePostFlusher 0 0 0 INFO [ScheduledTasks:1] 2013-06-14 09:29:37,568 StatusLogger.java (line 68) FlushWriter 0 0 0 INFO [ScheduledTasks:1] 2013-06-14 09:29:37,569 StatusLogger.java (line 68) MiscStage 0 0 0 INFO [ScheduledTasks:1] 2013-06-14 09:29:37,569 StatusLogger.java (line 68) commitlog_archiver 0 0 0 INFO [ScheduledTasks:1] 2013-06-14 09:29:37,570 StatusLogger.java (line 68) InternalResponseStage 0 0 0 INFO [ScheduledTasks:1] 2013-06-14 09:29:37,570 StatusLogger.java (line 68) HintedHandoff 0 0 0 INFO [ScheduledTasks:1] 2013-06-14 09:29:37,572 StatusLogger.java (line 73) CompactionManager 0 0 INFO [ScheduledTasks:1] 2013-06-14 09:29:37,573 StatusLogger.java (line 85) MessagingService n/a 0,42 INFO [ScheduledTasks:1] 2013-06-14 09:29:37,574 StatusLogger.java (line 95) Cache Type Size Capacity KeysToSave Provider INFO [ScheduledTasks:1] 2013-06-14 09:29:37,574 StatusLogger.java (line 96) KeyCache 602369792 1048576000 all INFO [ScheduledTasks:1] 2013-06-14 09:29:37,574 StatusLogger.java (line 102) RowCache 0 0 all Any hint to track down this error would be useful, Many Thanks, Haithem --20cf307f3a78be60a504df1c019d Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hi All,

We are facing a very strange is= sue in our C* ring. We are using C* v1.2.4, 7 Nodes in DC1, 3 Nodes in DC2 = and 3 Nodes in DC3.
We have been testing read/write performances = in DC1, by having different disks configurations.
For instance we have node1-DC1 use JBOD and node2-DC1 is using RAID-0 = configuration.
Over the last week everything seems to be running = fine until yesterday when node2-DC1 (RAID-0) config stop responding to clie= nt requests and timing out queries.=A0
JMX console showed up to 25k Thrift threads running, no pending compac= tion running, a lot of pending reads and that's it, CPU is averaging at= 10% heap usage is about 4GB of the 8GB available.
Node2-DC= 1 become unresponsive but still other node were trying to query it and it w= as not flagged as dead or unresponsive from Gossip messages, wondering if i= t's a bug.

Log file shows after stopping thrift from n= ode2-DC1:

=A0INFO [ScheduledTasks:1] 20= 13-06-14 09:29:37,433 StatusLogger.java (line 53) Pool Name =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0Active =A0 Pending =A0 Blocked
=A0INFO [ScheduledTasks:1] 2013-06-14 09:29:37,564 StatusLogger.java (= line 68) ReadStage =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A030 =A0 = =A0 =A0 959 =A0 =A0 =A0 =A0 0
=A0INFO [ScheduledTasks:1] 2013-06-= 14 09:29:37,564 StatusLogger.java (line 68) RequestResponseStage =A0 =A0 = =A0 =A0 =A0 =A0 =A00 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 =A0 0
=A0INFO [ScheduledTasks:1] 2013-06-14 09:29:37,565 StatusLogger.java (= line 68) ReadRepairStage =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 = =A0 0 =A0 =A0 =A0 =A0 0
=A0INFO [ScheduledTasks:1] 2013-06-14 09:= 29:37,565 StatusLogger.java (line 68) MutationStage =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 =A0 0
=A0INFO [ScheduledTasks:1] 2013-06-14 09:29:37,566 StatusLogger.java (= line 68) ReplicateOnWriteStage =A0 =A0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 =A0 0 = =A0 =A0 =A0 =A0 0
=A0INFO [ScheduledTasks:1] 2013-06-14 09:29:37,= 566 StatusLogger.java (line 68) GossipStage =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 =A0 0
=A0INFO [ScheduledTasks:1] 2013-06-14 09:29:37,567 StatusLogger.java (= line 68) AntiEntropyStage =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A00 =A0 =A0 =A0 = =A0 0 =A0 =A0 =A0 =A0 0
=A0INFO [ScheduledTasks:1] 2013-06-14 09:= 29:37,567 StatusLogger.java (line 68) MigrationStage =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A00 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 =A0 0
=A0INFO [ScheduledTasks:1] 2013-06-14 09:29:37,568 StatusLogger.java (= line 68) MemtablePostFlusher =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 =A0 = 0 =A0 =A0 =A0 =A0 0
=A0INFO [ScheduledTasks:1] 2013-06-14 09:29:3= 7,568 StatusLogger.java (line 68) FlushWriter =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 =A0 0
=A0INFO [ScheduledTasks:1] 2013-06-14 09:29:37,569 StatusLogger.java (= line 68) MiscStage =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 =A0 = =A0 =A0 =A0 0 =A0 =A0 =A0 =A0 0
=A0INFO [ScheduledTasks:1] 2013-0= 6-14 09:29:37,569 StatusLogger.java (line 68) commitlog_archiver =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A00 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 =A0 0
=A0INFO [ScheduledTasks:1] 2013-06-14 09:29:37,570 StatusLogger.java (= line 68) InternalResponseStage =A0 =A0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 =A0 0 = =A0 =A0 =A0 =A0 0
=A0INFO [ScheduledTasks:1] 2013-06-14 09:29:37,= 570 StatusLogger.java (line 68) HintedHandoff =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 =A0 0
=A0INFO [ScheduledTasks:1] 2013-06-14 09:29:37,572 StatusLogger.java (= line 73) CompactionManager =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 =A0 =A0 =A0 = =A0 0
=A0INFO [ScheduledTasks:1] 2013-06-14 09:29:37,573 StatusLo= gger.java (line 85) MessagingService =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0n/a =A0= =A0 =A00,42
=A0INFO [ScheduledTasks:1] 2013-06-14 09:29:37,574 StatusLogger.java (= line 95) Cache Type =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 Size =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 Capacity =A0 =A0 =A0 =A0 =A0 =A0 =A0 KeysToSave =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 Provider
=A0INFO [ScheduledTasks:1] 2013-06-14 09:29:37,574 StatusLogger.java (= line 96) KeyCache =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0602369792 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 1048576000 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0all = =A0 =A0=A0
=A0INFO [ScheduledTasks:1] 2013-06-14 09:29:37,574 Sta= tusLogger.java (line 102) RowCache =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A00 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A00 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0all


=A0Any hint to track d= own this error would be useful,=A0

Many Thanks,

Haithem=
--20cf307f3a78be60a504df1c019d--