From user-return-64077-archive-asf-public=cust-asf.ponee.io@cassandra.apache.org Thu Jun 20 10:25:57 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id D5B7B180670 for ; Thu, 20 Jun 2019 12:25:56 +0200 (CEST) Received: (qmail 50583 invoked by uid 500); 20 Jun 2019 10:25:53 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 50573 invoked by uid 99); 20 Jun 2019 10:25:53 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Jun 2019 10:25:53 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 9D12B180C6C for ; Thu, 20 Jun 2019 10:25:52 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 5.051 X-Spam-Level: ***** X-Spam-Status: No, score=5.051 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, KAM_LOTSOFHASH=0.25, KAM_TIME=3, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id O5X1L4VseqHp for ; Thu, 20 Jun 2019 10:25:50 +0000 (UTC) Received: from mail-vk1-f173.google.com (mail-vk1-f173.google.com [209.85.221.173]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id ABE845F39D for ; Thu, 20 Jun 2019 10:25:49 +0000 (UTC) Received: by mail-vk1-f173.google.com with SMTP id k1so470466vkb.2 for ; Thu, 20 Jun 2019 03:25:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=aS9xWrFWOxd84uPrjBV/XXMpfy5syFlK/AA5neJ8oQQ=; b=hnIxZJR9NMEmjKGMGjbsna1SJBf4lBq4GcKm9w6Sf17oEJHNtrxt71EIXexfV1HNW4 inJJs2nahlyho0wNuC4W+/ZSLmkj4MZSpaJPZEIhsHaZ6r3v5hcvAlnQmavqtWRqkyjq tawe8CxtyaUpCFG+YVOSpwn/G9N27ilbKBxrAwW2zHZ3yX0/TI2Do3slr/bnjiLdpgxu KpNdXC5X5FZgIQWqti8DFv7tVV5t1RXRkuO0fdOvZcbzTNlsbUstUU29S8CQFOsT3fPg iCRZ8OX1hU5CyF545OW6LgrtoND4213lEhcBO7LkQai7PzYvl7fpDi+Va1R+yVYdGbKq u3wQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=aS9xWrFWOxd84uPrjBV/XXMpfy5syFlK/AA5neJ8oQQ=; b=CJvbhw4eFYXex69OzALPhj09TDlXC9Vf6/k0p6rCDx6wr45PvH4QjqN+T1ig3gXqdL Cb1RFK7s5QcqtPYUOzWbQWOFp9W6cxdWZjlziFWC6iZ1MPfXIOip6iOsx010RE970Vj+ 1xyOn0L31JX+eULqTxt+2a8hkUhKUWGcGKuYFAuoed3f6MQU7MHz7cCTuLDCHEgCOwxd Sl9P2PnfEoTULddgkrkA3HXygyDDuqDTPPK9F9Y6EueYzUkRlskAVadCAn8A6BOmVBvZ cMoLlsMRsMrp64g/SL7duiU4wnSPt1zDLFEMbtXDWAb3bOFs9ar44IWKHrScSq93RK/W xysA== X-Gm-Message-State: APjAAAWMYv+E1rwYsc8zfqys5N9TNURRHvHxEV028oLrzM4EjDcck9qH 15dtLFJI9ywjP1d79SNnZdBmj0H0ArQoL1SXEGUv8C8WQwQ= X-Google-Smtp-Source: APXvYqz2H6vfcbaUZOJ3II7dr/ERzpqzKlWxogEULG+pv/U7GJHs5gSNRKhZZEdcOp8aCc3NQrKYNooOHJsohlH6XVY= X-Received: by 2002:a1f:8bc4:: with SMTP id n187mr6665776vkd.32.1561026348272; Thu, 20 Jun 2019 03:25:48 -0700 (PDT) MIME-Version: 1.0 References: <053247A8CBB6754B8345743B8F18D68D525B07A7@MOSTLS1MSGUSRFA.ITServices.sbc.com> In-Reply-To: <053247A8CBB6754B8345743B8F18D68D525B07A7@MOSTLS1MSGUSRFA.ITServices.sbc.com> From: Alain RODRIGUEZ Date: Thu, 20 Jun 2019 12:25:37 +0200 Message-ID: Subject: Re: node re-start delays , busy Deleting mc-txn-compaction/ Adding log file replica To: "user cassandra.apache.org" Content-Type: multipart/alternative; boundary="000000000000f9507d058bbec4d8" --000000000000f9507d058bbec4d8 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hello Asad, > I=E2=80=99m on environment with apache Cassandra 3.11.1 with java 1.8.0= _144. One Node went OOM and crashed. If I remember well, firsts minor versions of C* 3.11 have memory leaks. It seems it was fixed in your version though. 3.11.1 [...] * BTree.Builder memory leak (CASSANDRA-13754) Yet other improvements were made later on: > 3.11.3 [...] * Remove BTree.Builder Recycler to reduce memory usage (CASSANDRA-13929) * Reduce nodetool GC thread count (CASSANDRA-14475) See: https://github.com/apache/cassandra/blob/cassandra-3.11/CHANGES.txt. Before digging more I would upgrade to 3.11.latest (latest =3D 4 or 5 I guess), because early versions of a major Cassandra versions are famous for being quite broken, even though this major is a 'bug fix only' branch. Also minor versions upgrades are not too risky to go through. I would maybe start there if you're not too sure how to dig this. If it happens again or you don't want to upgrade, it would be interesting to know: - if the OOM happens inside the JVM or on native memory (then the OS would be the one sending the kill signal). These 2 issues have different (and sometime opposite) fixes. - What's the host size (especially memory) and how the heap (and maybe some off heap structures) are configured (at least what is not default). - If you saw errors in the logs and what the 'nodetool tpstats' was looking like when the node went down (it might have been dumped in the logs) I don't know much about those traces nor why Cassandra would take a long time. Though they are traces and harder to interpret for me. What does the INFO / WARN / ERR look like? Maybe opening a lot of SSTables and/or replaying a lot of commit logs, given the nature of the restart (post outage)? To speed up things, when nodes are not crashing, under normal circumstances, use 'nodetool drain' as part of stopping the node, before stopping/killing the service/process. C*heers, ----------------------- Alain Rodriguez - alain@thelastpickle.com France / Spain The Last Pickle - Apache Cassandra Consulting http://www.thelastpickle.com Le mar. 18 juin 2019 =C3=A0 23:43, ZAIDI, ASAD A a =C3=A9c= rit : > > > I=E2=80=99m on environment with apache Cassandra 3.11.1 with java 1.8.0= _144. > > > > One Node went OOM and crashed. Re-starting this crashed node is taking > long time. Trace level debug log is showing messages like: > > > > > > Debug.log trace excerpt: > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > TRACE [main] 2019-06-18 21:30:43,449 LogTransaction.java:217 - Deleting > /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166f= a28890a/mc-9337720-big-CompressionInfo.db > > TRACE [main] 2019-06-18 21:30:43,449 LogTransaction.java:217 - Deleting > /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166f= a28890a/mc-9337720-big-Filter.db > > TRACE [main] 2019-06-18 21:30:43,449 LogTransaction.java:217 - Deleting > /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166f= a28890a/mc-9337720-big-TOC.txt > > TRACE [main] 2019-06-18 21:30:43,455 LogTransaction.java:217 - Deleting > /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166f= a28890a/mc_txn_compaction_642976c0-91c3-11e9-97bb-6b1dee397c3f.log > > TRACE [main] 2019-06-18 21:30:43,458 LogReplicaSet.java:67 - Added log > file replica > /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166f= a28890a/mc_txn_compaction_5a6c8c90-91cc-11e9-97bb-6b1dee397c3f.log > > > > > > Above messages are repeated for unique [mc-nnnn-* ] files. Such messages > are repeating constantly. > > > > I=E2=80=99m seeking help here to find out what may be going on here , any= hint to > root cause and how I can quickly start the node. Thanks in advance. > > > > Regards/asad > > > > > > > --000000000000f9507d058bbec4d8 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hello Asad,
=C2= =A0
I=E2=80=99m on environment with =C2=A0apache Cass= andra 3.11.1 with =C2=A0java 1.8.0_144.
One No= de went OOM and crashed.=C2=A0

If I remembe= r well, firsts minor versions of C* 3.11 have memory leaks. It seems it was= fixed in your version though.

3.11.1<= /blockquote>
[...]
=C2=A0* BTr= ee.Builder memory leak (CASSANDRA-13754)

Ye= t other improvements were made later on:=C2=A0
=C2=A0
3.11.3
[...]
=C2=A0* Remove BTree.Builder Recycler to reduce memory usage (CASSAN= DRA-13929)
=C2=A0* Reduce nodetool GC thread c= ount (CASSANDRA-14475)

Be= fore digging more I would upgrade to 3.11.latest (latest =3D 4 or 5 I guess= ), because early versions of a major Cassandra versions are famous for bein= g quite broken, even though this major is a 'bug fix only' branch.<= br>
Also minor versions upgrades are not too risky to go through.= I would maybe start there if you're not too sure how to dig this.

If it happens again or you don't want to upgrade, = it would be interesting to know:
- =C2=A0if the OOM happens insid= e the JVM or on native memory (then the OS would be the one sending the kil= l signal). These 2 issues have different (and sometime opposite) fixes.
=
- What's the host size (especially memory) and how the heap = (and maybe some off heap structures) are configured (at least what is not d= efault).
- If you saw errors in the logs and what the 'nodeto= ol tpstats' was looking like when the node went down (it might have bee= n dumped in the logs)

I don't know much about = those traces nor why Cassandra would take a long time. Though they are trac= es and harder to interpret for me. What does the INFO / WARN / ERR look lik= e?
Maybe opening a lot of SSTables and/or replaying a lot of comm= it logs, given the nature of the restart (post outage)?
To speed = up things, when nodes are not crashing, under normal circumstances, use = 9;nodetool drain' as part of stopping the node, before stopping/killing= the service/process.=C2=A0

C*heers,
-----------------------
Alain Rodriguez - alain@thelastpickle.com
France / Spain<= /div>

The Last Pickle - Apache Cassandra Consulting

Le=C2=A0mar. 18 juin 20= 19 =C3=A0=C2=A023:43, ZAIDI, ASAD A <a= z192g@att.com> a =C3=A9crit=C2=A0:

=C2=A0

I=E2=80=99m on environment with =C2=A0apache Cassandra = 3.11.1 with =C2=A0java 1.8.0_144.

=C2=A0

One Node went OOM and crashed. Re-starting this crashed= node is taking long time. Trace level debug log is showing messages like:<= u>

=C2=A0

=C2=A0

Debug.log trace excerpt:

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D

=C2=A0

TRACE [main] 2019-06-18 21:30:43,449 LogTransaction.jav= a:217 - Deleting /cassandra/data/enterprise/device_connection_ws-f65649e0ae= a011e7baeb8166fa28890a/mc-9337720-big-CompressionInfo.db

TRACE [main] 2019-06-18 21:30:43,449 LogTransaction.jav= a:217 - Deleting /cassandra/data/enterprise/device_connection_ws-f65649e0ae= a011e7baeb8166fa28890a/mc-9337720-big-Filter.db

TRACE [main] 2019-06-18 21:30:43,449 LogTransaction.jav= a:217 - Deleting /cassandra/data/enterprise/device_connection_ws-f65649e0ae= a011e7baeb8166fa28890a/mc-9337720-big-TOC.txt

TRACE [main] 2019-06-18 21:30:43,455 LogTransaction.jav= a:217 - Deleting /cassandra/data/enterprise/device_connection_ws-f65649e0ae= a011e7baeb8166fa28890a/mc_txn_compaction_642976c0-91c3-11e9-97bb-6b1dee397c= 3f.log

TRACE [main] 2019-06-18 21:30:43,458 LogReplicaSet.java= :67 - Added log file replica /cassandra/data/enterprise/device_connection_w= s-f65649e0aea011e7baeb8166fa28890a/mc_txn_compaction_5a6c8c90-91cc-11e9-97b= b-6b1dee397c3f.log

=C2=A0

=C2=A0

Above messages are repeated for unique [mc-nnnn-* ] fil= es. Such messages are repeating constantly.

=C2=A0

I=E2=80=99m seeking help here to find out what may be g= oing on here , any hint to root cause and how I can quickly start the node.= Thanks in advance.

=C2=A0

Regards/asad

=C2=A0

=C2=A0

=C2=A0

--000000000000f9507d058bbec4d8--