From user-return-63769-archive-asf-public=cust-asf.ponee.io@cassandra.apache.org  Wed May  1 10:54:13 2019
Return-Path: <user-return-63769-archive-asf-public=cust-asf.ponee.io@cassandra.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 8D9AF180629
	for <archive-asf-public@cust-asf.ponee.io>; Wed,  1 May 2019 12:54:12 +0200 (CEST)
Received: (qmail 83734 invoked by uid 500); 1 May 2019 10:54:09 -0000
Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@cassandra.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@cassandra.apache.org>
List-Post: <mailto:user@cassandra.apache.org>
List-Id: <user.cassandra.apache.org>
Reply-To: user@cassandra.apache.org
Delivered-To: mailing list user@cassandra.apache.org
Received: (qmail 83710 invoked by uid 99); 1 May 2019 10:54:09 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 May 2019 10:54:09 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id A2ED2C2373
	for <user@cassandra.apache.org>; Wed,  1 May 2019 10:54:08 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 2.798
X-Spam-Level: **
X-Spam-Status: No, score=2.798 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	DKIM_VALID_EF=-0.1, FREEMAIL_REPLY=1, HTML_MESSAGE=2,
	RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001]
	autolearn=disabled
Authentication-Results: spamd4-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024)
	with ESMTP id 2IGWTW8VjHrz for <user@cassandra.apache.org>;
	Wed,  1 May 2019 10:54:06 +0000 (UTC)
Received: from mail-it1-f194.google.com (mail-it1-f194.google.com [209.85.166.194])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id C7D176118B
	for <user@cassandra.apache.org>; Wed,  1 May 2019 10:48:02 +0000 (UTC)
Received: by mail-it1-f194.google.com with SMTP id q14so9225193itk.0
        for <user@cassandra.apache.org>; Wed, 01 May 2019 03:48:02 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=DfQAMnUev43pP7K70f6hjOWr2n1WSZi7dk5HbhOLT/U=;
        b=dzf8XGyyUsalhgbAFkWIEcLIBFnqP/qmmVFBNJz+FL0vzBsKxaE/z02EIdZqSR/Sr2
         QRznwy1ZOWWf24IRADymJw8eYHjuk69wLz2txpHIYsHg/J2xr7N/Icc5rMeSuOw+YmsM
         xvYhKKjQd9sVLDael5NK4yR4q4eFzISxdSPI92l18Yx/HIGxeRZ4dOdNEHFNiuB3fE4h
         ZYwK51OaAGUqs5WUBNSJkm72stHPrKUH1WxTCE1nvgCVpZGuKfLOYHt0inPPPKeDkmQP
         xK+k4pSdoLJtEaI1YVmkxEcUV4VWBT5mlkFkUnFY5CWQjk4oIIEN5lTGnZZsUHyQerls
         qqBg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=DfQAMnUev43pP7K70f6hjOWr2n1WSZi7dk5HbhOLT/U=;
        b=YAgfMSWVGCd0WQpnJJfWyy53coWppdXH1cBjdMd2u8Wcy0zBG0AtyqmGJCoyN+v5fH
         ILk0HPcYCgQfkd0gnjkSAX9TMl6jiz/wQkrZ7y8XfwRW8VfstM1y8UW1ilprijKPepkI
         LeeFc5omko6ZgDjQKSokwogHKwMfiFLoDRVK/Osf4hs8r8Mh8mcqg/JuZZQGFty2eZ3Y
         nzKlbUx05vowJxZXGFKc9uLEeJ2DWGv8r5j02L187atRPQxGLR/Kl3AjI2gQ2Ud+cLgl
         OZRu9vMuGeFzMgOHc4C0syrZAUdRFrTANWT3MKBBxyOBtkRySiUdb7mclqOC68Myy4zG
         H0ng==
X-Gm-Message-State: APjAAAVfz3ny+IYsUAheNV9Z5wVFXO5v5AexoAeub7IQ2vu+Z7k8nFio
	xAYhfccke6oJBSLtaW+9OA0OkhQRY2GhFXJcABXHvQ==
X-Google-Smtp-Source: APXvYqwSO8zy5IXchlhuLHDQbVVz0TBeZLt+Tc2MaHp08xhRv8fj7XJkSVQHuqAxXXlQstPlBceIuuMetbIMUlIZByc=
X-Received: by 2002:a24:5ec2:: with SMTP id h185mr8471497itb.19.1556707681199;
 Wed, 01 May 2019 03:48:01 -0700 (PDT)
MIME-Version: 1.0
References: <CAOS0qEvJMeC25Tt6ZpyDvHj93QPJ32BYSjBYud1dBhEH1ESC1A@mail.gmail.com>
 <14BA4880-CCDA-4672-A071-7DFF89C1817F@gmail.com> <pony-c33234b898ef0cc6e60cca74daaa4eb48da8712b-795efaa11f5baea6ea1f239f7907b698dc3c71d2@user.cassandra.apache.org>
 <CAOS0qEtnN1K69xFHvTN-DL_z1C9M+WfOwqQ9-xiq3K0sq74cHw@mail.gmail.com> <pony-c33234b898ef0cc6e60cca74daaa4eb48da8712b-169f2df22141112638cf62cb2f222d2f4e6133f2@user.cassandra.apache.org>
In-Reply-To: <pony-c33234b898ef0cc6e60cca74daaa4eb48da8712b-169f2df22141112638cf62cb2f222d2f4e6133f2@user.cassandra.apache.org>
From: Sandeep Nethi <nethisandeeps@gmail.com>
Date: Wed, 1 May 2019 22:47:50 +1200
Message-ID: <CAL451BB8FpyzyJkCpsdp7iwpQY0+d2E3VrKLXegGvks9AZDHLA@mail.gmail.com>
Subject: Re: cassandra node was put down with oom error
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary="0000000000005b6a130587d14020"

--0000000000005b6a130587d14020
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Are you by any chance running the full repair on these nodes?

Thanks,
Sandeep

On Wed, 1 May 2019 at 10:46 PM, Mia <yeomii999@gmail.com> wrote:

> Hello, Ayub.
>
> I'm using apache cassandra, not dse edition. So I have never used the dse
> search feature.
> In my case, all the nodes of the cluster have the same problem.
>
> Thanks.
>
> On 2019/05/01 06:13:06, Ayub M <hiayub@gmail.com> wrote:
> > Do you have search on the same nodes or is it only cassandra. In my cas=
e
> it
> > was due to a memory leak bug in dse search that consumed more memory
> > resulting in oom.
> >
> > On Tue, Apr 30, 2019, 2:58 AM yeomii999@gmail.com <yeomii999@gmail.com>
> > wrote:
> >
> > > Hello,
> > >
> > > I'm suffering from similar problem with OSS cassandra version3.11.3.
> > > My cassandra cluster have been running for longer than 1 years and
> there
> > > was no problem until this year.
> > > The cluster is write-intensive, consists of 70 nodes, and all rows
> have 2
> > > hr TTL.
> > > The only change is the read consistency from QUORUM to ONE. (I cannot
> > > revert this change because of the read latency)
> > > Below is my compaction strategy.
> > > ```
> > > compaction =3D {'class':
> > > 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy',
> > > 'compaction_window_size': '3', 'compaction_window_unit': 'MINUTES',
> > > 'enabled': 'true', 'max_threshold': '32', 'min_threshold': '4',
> > > 'tombstone_compaction_interval': '60', 'tombstone_threshold': '0.2',
> > > 'unchecked_tombstone_compaction': 'false'}
> > > ```
> > > I've tried rolling restarting the cluster several times,
> > > but the memory usage of cassandra process always keeps going high.
> > > I also tried Native Memory Tracking, but it only measured less memory
> > > usage than the system mesaures (RSS in /proc/{cassandra-pid}/status)
> > >
> > > Is there any way that I could figure out the cause of this problem?
> > >
> > >
> > > On 2019/01/26 20:53:26, Jeff Jirsa <jjirsa@gmail.com> wrote:
> > > > You=E2=80=99re running DSE so the OSS list may not be much help. Da=
tastax May
> > > have more insight
> > > >
> > > > In open source, the only things offheap that vary significantly are
> > > bloom filters and compression offsets - both scale with disk space, a=
nd
> > > both increase during compaction. Large STCS compaction can cause pret=
ty
> > > meaningful allocations for these. Also, if you have an unusually low
> > > compression chunk size or a very low bloom filter FP ratio, those wil=
l
> be
> > > larger.
> > > >
> > > >
> > > > --
> > > > Jeff Jirsa
> > > >
> > > >
> > > > > On Jan 26, 2019, at 12:11 PM, Ayub M <hiayub@gmail.com> wrote:
> > > > >
> > > > > Cassandra node went down due to OOM, and checking the
> /var/log/message
> > > I see below.
> > > > >
> > > > > ```
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: java invoked oom-kille=
r:
> > > gfp_mask=3D0x280da, order=3D0, oom_score_adj=3D0
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: java cpuset=3D/
> mems_allowed=3D0
> > > > > ....
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 DMA: 1*4kB (U)
> 0*8kB
> > > 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB
> (U)
> > > 1*2048kB (M) 3*4096kB (M) =3D 15908kB
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 DMA32: 1294*4kB
> (UM)
> > > 932*8kB (UEM) 897*16kB (UEM) 483*32kB (UEM) 224*64kB (UEM) 114*128kB
> (UEM)
> > > 41*256kB (UEM) 12*512kB (UEM) 7*1024kB (UE
> > > > > M) 2*2048kB (EM) 35*4096kB (UM) =3D 242632kB
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 Normal: 5319*4k=
B
> > > (UE) 3233*8kB (UEM) 960*16kB (UE) 0*32kB 0*64kB 0*128kB 0*256kB 0*512=
kB
> > > 0*1024kB 0*2048kB 0*4096kB =3D 62500kB
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 hugepages_total=
=3D0
> > > hugepages_free=3D0 hugepages_surp=3D0 hugepages_size=3D1048576kB
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 hugepages_total=
=3D0
> > > hugepages_free=3D0 hugepages_surp=3D0 hugepages_size=3D2048kB
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 38109 total pagecache
> pages
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 0 pages in swap cache
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Swap cache stats: add =
0,
> > > delete 0, find 0/0
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Free swap  =3D 0kB
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Total swap =3D 0kB
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 16394647 pages RAM
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 0 pages
> HighMem/MovableOnly
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 310559 pages reserved
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ pid ]   uid  tgid
> > > total_vm      rss nr_ptes swapents oom_score_adj name
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 2634]     0  2634
> > > 41614      326      82        0             0 systemd-journal
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 2690]     0  2690
> > > 29793      541      27        0             0 lvmetad
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 2710]     0  2710
> > > 11892      762      25        0         -1000 systemd-udevd
> > > > > .....
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [13774]     0 13774
> > >  459778    97729     429        0             0 Scan Factory
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14506]     0 14506
> > > 21628     5340      24        0             0 macompatsvc
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14586]     0 14586
> > > 21628     5340      24        0             0 macompatsvc
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14588]     0 14588
> > > 21628     5340      24        0             0 macompatsvc
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14589]     0 14589
> > > 21628     5340      24        0             0 macompatsvc
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14598]     0 14598
> > > 21628     5340      24        0             0 macompatsvc
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14599]     0 14599
> > > 21628     5340      24        0             0 macompatsvc
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14600]     0 14600
> > > 21628     5340      24        0             0 macompatsvc
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14601]     0 14601
> > > 21628     5340      24        0             0 macompatsvc
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [19679]     0 19679
> > > 21628     5340      24        0             0 macompatsvc
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [19680]     0 19680
> > > 21628     5340      24        0             0 macompatsvc
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 9084]  1007  9084
> > > 2822449   260291     810        0             0 java
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 8509]  1007  8509
> > > 17223585 14908485   32510        0             0 java
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [21877]     0 21877
> > >  461828    97716     318        0             0 ScanAction Mgr
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [21884]     0 21884
> > >  496653    98605     340        0             0 OAS Manager
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [31718]    89 31718
> > > 25474      486      48        0             0 pickup
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 4891]  1007  4891
> > > 26999      191       9        0             0 iostat
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 4957]  1007  4957
> > > 26999      192      10        0             0 iostat
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Out of memory: Kill
> process
> > > 8509 (java) score 928 or sacrifice child
> > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Killed process 8509
> (java)
> > > total-vm:68894340kB, anon-rss:59496344kB, file-rss:137596kB,
> shmem-rss:0kB
> > > > > ```
> > > > >
> > > > > Nothing else runs on this host except dse cassandra with search a=
nd
> > > monitoring agents. Max heap size is set to 31g, the cassandra java
> process
> > > seems to be using ~57gb (ram is 62gb) at the time of error.
> > > > > So I am guess the jvm started using lots of memory and triggered
> oom
> > > error.
> > > > > Is my understanding correct?
> > > > > That this is linux triggered jvm kill as the jvm was consuming mo=
re
> > > than available memory?
> > > > >
> > > > > So in this case jvm was using max of 31g and remaining 26gb its
> using
> > > is non-heap memory. Normally this process takes around 42g and the fa=
ct
> > > that at the time of oom moment it was consuming 57g I am suspecting t=
he
> > > java process to be the culprit rather than victim.
> > > > >
> > > > > At the time of issue there was no heap dump taken, I have
> configured
> > > it now. But even if heap dump was taken would it have help figure out
> who
> > > is consuming more memory. Heapdump would only dump heap memory area,
> what
> > > should be used to dump non-heapdump? Native memory tracking is one
> thing I
> > > came across.
> > > > > Any way to have native memory dumped when oom occurs?
> > > > > Whats the best way to monitor the jvm memory to diagnose oom
> errors?
> > > > >
> > > >
> > > > -------------------------------------------------------------------=
--
> > > > To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> > > > For additional commands, e-mail: user-help@cassandra.apache.org
> > > >
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> > > For additional commands, e-mail: user-help@cassandra.apache.org
> > >
> > >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: user-help@cassandra.apache.org
>
>

--0000000000005b6a130587d14020
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div><div dir=3D"auto">Are you by any chance running the full repair on the=
se nodes?</div></div><div dir=3D"auto"><br></div><div dir=3D"auto">Thanks,<=
/div><div dir=3D"auto">Sandeep</div><div><br><div class=3D"gmail_quote"><di=
v dir=3D"ltr" class=3D"gmail_attr">On Wed, 1 May 2019 at 10:46 PM, Mia &lt;=
<a href=3D"mailto:yeomii999@gmail.com">yeomii999@gmail.com</a>&gt; wrote:<b=
r></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border=
-left:1px #ccc solid;padding-left:1ex">Hello, Ayub.<br>
<br>
I&#39;m using apache cassandra, not dse edition. So I have never used the d=
se search feature.<br>
In my case, all the nodes of the cluster have the same problem. <br>
<br>
Thanks.<br>
<br>
On 2019/05/01 06:13:06, Ayub M &lt;<a href=3D"mailto:hiayub@gmail.com" targ=
et=3D"_blank">hiayub@gmail.com</a>&gt; wrote: <br>
&gt; Do you have search on the same nodes or is it only cassandra. In my ca=
se it<br>
&gt; was due to a memory leak bug in dse search that consumed more memory<b=
r>
&gt; resulting in oom.<br>
&gt; <br>
&gt; On Tue, Apr 30, 2019, 2:58 AM <a href=3D"mailto:yeomii999@gmail.com" t=
arget=3D"_blank">yeomii999@gmail.com</a> &lt;<a href=3D"mailto:yeomii999@gm=
ail.com" target=3D"_blank">yeomii999@gmail.com</a>&gt;<br>
&gt; wrote:<br>
&gt; <br>
&gt; &gt; Hello,<br>
&gt; &gt;<br>
&gt; &gt; I&#39;m suffering from similar problem with OSS cassandra version=
3.11.3.<br>
&gt; &gt; My cassandra cluster have been running for longer than 1 years an=
d there<br>
&gt; &gt; was no problem until this year.<br>
&gt; &gt; The cluster is write-intensive, consists of 70 nodes, and all row=
s have 2<br>
&gt; &gt; hr TTL.<br>
&gt; &gt; The only change is the read consistency from QUORUM to ONE. (I ca=
nnot<br>
&gt; &gt; revert this change because of the read latency)<br>
&gt; &gt; Below is my compaction strategy.<br>
&gt; &gt; ```<br>
&gt; &gt; compaction =3D {&#39;class&#39;:<br>
&gt; &gt; &#39;org.apache.cassandra.db.compaction.TimeWindowCompactionStrat=
egy&#39;,<br>
&gt; &gt; &#39;compaction_window_size&#39;: &#39;3&#39;, &#39;compaction_wi=
ndow_unit&#39;: &#39;MINUTES&#39;,<br>
&gt; &gt; &#39;enabled&#39;: &#39;true&#39;, &#39;max_threshold&#39;: &#39;=
32&#39;, &#39;min_threshold&#39;: &#39;4&#39;,<br>
&gt; &gt; &#39;tombstone_compaction_interval&#39;: &#39;60&#39;, &#39;tombs=
tone_threshold&#39;: &#39;0.2&#39;,<br>
&gt; &gt; &#39;unchecked_tombstone_compaction&#39;: &#39;false&#39;}<br>
&gt; &gt; ```<br>
&gt; &gt; I&#39;ve tried rolling restarting the cluster several times,<br>
&gt; &gt; but the memory usage of cassandra process always keeps going high=
.<br>
&gt; &gt; I also tried Native Memory Tracking, but it only measured less me=
mory<br>
&gt; &gt; usage than the system mesaures (RSS in /proc/{cassandra-pid}/stat=
us)<br>
&gt; &gt;<br>
&gt; &gt; Is there any way that I could figure out the cause of this proble=
m?<br>
&gt; &gt;<br>
&gt; &gt;<br>
&gt; &gt; On 2019/01/26 20:53:26, Jeff Jirsa &lt;<a href=3D"mailto:jjirsa@g=
mail.com" target=3D"_blank">jjirsa@gmail.com</a>&gt; wrote:<br>
&gt; &gt; &gt; You=E2=80=99re running DSE so the OSS list may not be much h=
elp. Datastax May<br>
&gt; &gt; have more insight<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; In open source, the only things offheap that vary significan=
tly are<br>
&gt; &gt; bloom filters and compression offsets - both scale with disk spac=
e, and<br>
&gt; &gt; both increase during compaction. Large STCS compaction can cause =
pretty<br>
&gt; &gt; meaningful allocations for these. Also, if you have an unusually =
low<br>
&gt; &gt; compression chunk size or a very low bloom filter FP ratio, those=
 will be<br>
&gt; &gt; larger.<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; --<br>
&gt; &gt; &gt; Jeff Jirsa<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; On Jan 26, 2019, at 12:11 PM, Ayub M &lt;<a href=3D"mai=
lto:hiayub@gmail.com" target=3D"_blank">hiayub@gmail.com</a>&gt; wrote:<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; Cassandra node went down due to OOM, and checking the /=
var/log/message<br>
&gt; &gt; I see below.<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; ```<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: java invoked=
 oom-killer:<br>
&gt; &gt; gfp_mask=3D0x280da, order=3D0, oom_score_adj=3D0<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: java cpuset=
=3D/ mems_allowed=3D0<br>
&gt; &gt; &gt; &gt; ....<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 DMA: =
1*4kB (U) 0*8kB<br>
&gt; &gt; 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*10=
24kB (U)<br>
&gt; &gt; 1*2048kB (M) 3*4096kB (M) =3D 15908kB<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 DMA32=
: 1294*4kB (UM)<br>
&gt; &gt; 932*8kB (UEM) 897*16kB (UEM) 483*32kB (UEM) 224*64kB (UEM) 114*12=
8kB (UEM)<br>
&gt; &gt; 41*256kB (UEM) 12*512kB (UEM) 7*1024kB (UE<br>
&gt; &gt; &gt; &gt; M) 2*2048kB (EM) 35*4096kB (UM) =3D 242632kB<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 Norma=
l: 5319*4kB<br>
&gt; &gt; (UE) 3233*8kB (UEM) 960*16kB (UE) 0*32kB 0*64kB 0*128kB 0*256kB 0=
*512kB<br>
&gt; &gt; 0*1024kB 0*2048kB 0*4096kB =3D 62500kB<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 hugep=
ages_total=3D0<br>
&gt; &gt; hugepages_free=3D0 hugepages_surp=3D0 hugepages_size=3D1048576kB<=
br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 hugep=
ages_total=3D0<br>
&gt; &gt; hugepages_free=3D0 hugepages_surp=3D0 hugepages_size=3D2048kB<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 38109 total =
pagecache pages<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 0 pages in s=
wap cache<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Swap cache s=
tats: add 0,<br>
&gt; &gt; delete 0, find 0/0<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Free swap=C2=
=A0 =3D 0kB<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Total swap =
=3D 0kB<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 16394647 pag=
es RAM<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 0 pages High=
Mem/MovableOnly<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 310559 pages=
 reserved<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ pid ]=C2=
=A0 =C2=A0uid=C2=A0 tgid<br>
&gt; &gt; total_vm=C2=A0 =C2=A0 =C2=A0 rss nr_ptes swapents oom_score_adj n=
ame<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 2634]=C2=
=A0 =C2=A0 =C2=A00=C2=A0 2634<br>
&gt; &gt; 41614=C2=A0 =C2=A0 =C2=A0 326=C2=A0 =C2=A0 =C2=A0 82=C2=A0 =C2=A0=
 =C2=A0 =C2=A0 0=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 systemd-j=
ournal<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 2690]=C2=
=A0 =C2=A0 =C2=A00=C2=A0 2690<br>
&gt; &gt; 29793=C2=A0 =C2=A0 =C2=A0 541=C2=A0 =C2=A0 =C2=A0 27=C2=A0 =C2=A0=
 =C2=A0 =C2=A0 0=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 lvmetad<b=
r>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 2710]=C2=
=A0 =C2=A0 =C2=A00=C2=A0 2710<br>
&gt; &gt; 11892=C2=A0 =C2=A0 =C2=A0 762=C2=A0 =C2=A0 =C2=A0 25=C2=A0 =C2=A0=
 =C2=A0 =C2=A0 0=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-1000 systemd-udevd<br>
&gt; &gt; &gt; &gt; .....<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [13774]=C2=
=A0 =C2=A0 =C2=A00 13774<br>
&gt; &gt;=C2=A0 459778=C2=A0 =C2=A0 97729=C2=A0 =C2=A0 =C2=A0429=C2=A0 =C2=
=A0 =C2=A0 =C2=A0 0=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 Scan F=
actory<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14506]=C2=
=A0 =C2=A0 =C2=A00 14506<br>
&gt; &gt; 21628=C2=A0 =C2=A0 =C2=A05340=C2=A0 =C2=A0 =C2=A0 24=C2=A0 =C2=A0=
 =C2=A0 =C2=A0 0=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 macompats=
vc<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14586]=C2=
=A0 =C2=A0 =C2=A00 14586<br>
&gt; &gt; 21628=C2=A0 =C2=A0 =C2=A05340=C2=A0 =C2=A0 =C2=A0 24=C2=A0 =C2=A0=
 =C2=A0 =C2=A0 0=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 macompats=
vc<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14588]=C2=
=A0 =C2=A0 =C2=A00 14588<br>
&gt; &gt; 21628=C2=A0 =C2=A0 =C2=A05340=C2=A0 =C2=A0 =C2=A0 24=C2=A0 =C2=A0=
 =C2=A0 =C2=A0 0=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 macompats=
vc<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14589]=C2=
=A0 =C2=A0 =C2=A00 14589<br>
&gt; &gt; 21628=C2=A0 =C2=A0 =C2=A05340=C2=A0 =C2=A0 =C2=A0 24=C2=A0 =C2=A0=
 =C2=A0 =C2=A0 0=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 macompats=
vc<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14598]=C2=
=A0 =C2=A0 =C2=A00 14598<br>
&gt; &gt; 21628=C2=A0 =C2=A0 =C2=A05340=C2=A0 =C2=A0 =C2=A0 24=C2=A0 =C2=A0=
 =C2=A0 =C2=A0 0=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 macompats=
vc<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14599]=C2=
=A0 =C2=A0 =C2=A00 14599<br>
&gt; &gt; 21628=C2=A0 =C2=A0 =C2=A05340=C2=A0 =C2=A0 =C2=A0 24=C2=A0 =C2=A0=
 =C2=A0 =C2=A0 0=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 macompats=
vc<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14600]=C2=
=A0 =C2=A0 =C2=A00 14600<br>
&gt; &gt; 21628=C2=A0 =C2=A0 =C2=A05340=C2=A0 =C2=A0 =C2=A0 24=C2=A0 =C2=A0=
 =C2=A0 =C2=A0 0=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 macompats=
vc<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14601]=C2=
=A0 =C2=A0 =C2=A00 14601<br>
&gt; &gt; 21628=C2=A0 =C2=A0 =C2=A05340=C2=A0 =C2=A0 =C2=A0 24=C2=A0 =C2=A0=
 =C2=A0 =C2=A0 0=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 macompats=
vc<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [19679]=C2=
=A0 =C2=A0 =C2=A00 19679<br>
&gt; &gt; 21628=C2=A0 =C2=A0 =C2=A05340=C2=A0 =C2=A0 =C2=A0 24=C2=A0 =C2=A0=
 =C2=A0 =C2=A0 0=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 macompats=
vc<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [19680]=C2=
=A0 =C2=A0 =C2=A00 19680<br>
&gt; &gt; 21628=C2=A0 =C2=A0 =C2=A05340=C2=A0 =C2=A0 =C2=A0 24=C2=A0 =C2=A0=
 =C2=A0 =C2=A0 0=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 macompats=
vc<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 9084]=C2=
=A0 1007=C2=A0 9084<br>
&gt; &gt; 2822449=C2=A0 =C2=A0260291=C2=A0 =C2=A0 =C2=A0810=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 0=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 java<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 8509]=C2=
=A0 1007=C2=A0 8509<br>
&gt; &gt; 17223585 14908485=C2=A0 =C2=A032510=C2=A0 =C2=A0 =C2=A0 =C2=A0 0=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 java<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [21877]=C2=
=A0 =C2=A0 =C2=A00 21877<br>
&gt; &gt;=C2=A0 461828=C2=A0 =C2=A0 97716=C2=A0 =C2=A0 =C2=A0318=C2=A0 =C2=
=A0 =C2=A0 =C2=A0 0=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 ScanAc=
tion Mgr<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [21884]=C2=
=A0 =C2=A0 =C2=A00 21884<br>
&gt; &gt;=C2=A0 496653=C2=A0 =C2=A0 98605=C2=A0 =C2=A0 =C2=A0340=C2=A0 =C2=
=A0 =C2=A0 =C2=A0 0=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 OAS Ma=
nager<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [31718]=C2=
=A0 =C2=A0 89 31718<br>
&gt; &gt; 25474=C2=A0 =C2=A0 =C2=A0 486=C2=A0 =C2=A0 =C2=A0 48=C2=A0 =C2=A0=
 =C2=A0 =C2=A0 0=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 pickup<br=
>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 4891]=C2=
=A0 1007=C2=A0 4891<br>
&gt; &gt; 26999=C2=A0 =C2=A0 =C2=A0 191=C2=A0 =C2=A0 =C2=A0 =C2=A09=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 0=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 ios=
tat<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 4957]=C2=
=A0 1007=C2=A0 4957<br>
&gt; &gt; 26999=C2=A0 =C2=A0 =C2=A0 192=C2=A0 =C2=A0 =C2=A0 10=C2=A0 =C2=A0=
 =C2=A0 =C2=A0 0=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 iostat<br=
>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Out of memor=
y: Kill process<br>
&gt; &gt; 8509 (java) score 928 or sacrifice child<br>
&gt; &gt; &gt; &gt; Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Killed proce=
ss 8509 (java)<br>
&gt; &gt; total-vm:68894340kB, anon-rss:59496344kB, file-rss:137596kB, shme=
m-rss:0kB<br>
&gt; &gt; &gt; &gt; ```<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; Nothing else runs on this host except dse cassandra wit=
h search and<br>
&gt; &gt; monitoring agents. Max heap size is set to 31g, the cassandra jav=
a process<br>
&gt; &gt; seems to be using ~57gb (ram is 62gb) at the time of error.<br>
&gt; &gt; &gt; &gt; So I am guess the jvm started using lots of memory and =
triggered oom<br>
&gt; &gt; error.<br>
&gt; &gt; &gt; &gt; Is my understanding correct?<br>
&gt; &gt; &gt; &gt; That this is linux triggered jvm kill as the jvm was co=
nsuming more<br>
&gt; &gt; than available memory?<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; So in this case jvm was using max of 31g and remaining =
26gb its using<br>
&gt; &gt; is non-heap memory. Normally this process takes around 42g and th=
e fact<br>
&gt; &gt; that at the time of oom moment it was consuming 57g I am suspecti=
ng the<br>
&gt; &gt; java process to be the culprit rather than victim.<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; At the time of issue there was no heap dump taken, I ha=
ve configured<br>
&gt; &gt; it now. But even if heap dump was taken would it have help figure=
 out who<br>
&gt; &gt; is consuming more memory. Heapdump would only dump heap memory ar=
ea, what<br>
&gt; &gt; should be used to dump non-heapdump? Native memory tracking is on=
e thing I<br>
&gt; &gt; came across.<br>
&gt; &gt; &gt; &gt; Any way to have native memory dumped when oom occurs?<b=
r>
&gt; &gt; &gt; &gt; Whats the best way to monitor the jvm memory to diagnos=
e oom errors?<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; ------------------------------------------------------------=
---------<br>
&gt; &gt; &gt; To unsubscribe, e-mail: <a href=3D"mailto:user-unsubscribe@c=
assandra.apache.org" target=3D"_blank">user-unsubscribe@cassandra.apache.or=
g</a><br>
&gt; &gt; &gt; For additional commands, e-mail: <a href=3D"mailto:user-help=
@cassandra.apache.org" target=3D"_blank">user-help@cassandra.apache.org</a>=
<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt;<br>
&gt; &gt;<br>
&gt; &gt; -----------------------------------------------------------------=
----<br>
&gt; &gt; To unsubscribe, e-mail: <a href=3D"mailto:user-unsubscribe@cassan=
dra.apache.org" target=3D"_blank">user-unsubscribe@cassandra.apache.org</a>=
<br>
&gt; &gt; For additional commands, e-mail: <a href=3D"mailto:user-help@cass=
andra.apache.org" target=3D"_blank">user-help@cassandra.apache.org</a><br>
&gt; &gt;<br>
&gt; &gt;<br>
&gt; <br>
<br>
---------------------------------------------------------------------<br>
To unsubscribe, e-mail: <a href=3D"mailto:user-unsubscribe@cassandra.apache=
.org" target=3D"_blank">user-unsubscribe@cassandra.apache.org</a><br>
For additional commands, e-mail: <a href=3D"mailto:user-help@cassandra.apac=
he.org" target=3D"_blank">user-help@cassandra.apache.org</a><br>
<br>
</blockquote></div></div>

--0000000000005b6a130587d14020--