Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 0A202200BC0 for ; Tue, 15 Nov 2016 20:23:30 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 08B56160B03; Tue, 15 Nov 2016 19:23:30 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id D0DBC160AF2 for ; Tue, 15 Nov 2016 20:23:28 +0100 (CET) Received: (qmail 95897 invoked by uid 500); 15 Nov 2016 19:23:27 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 95887 invoked by uid 99); 15 Nov 2016 19:23:27 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Nov 2016 19:23:27 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id E72C7185F1F for ; Tue, 15 Nov 2016 19:23:26 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.399 X-Spam-Level: ** X-Spam-Status: No, score=2.399 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id EFYZAg4p2src for ; Tue, 15 Nov 2016 19:23:24 +0000 (UTC) Received: from mail-vk0-f43.google.com (mail-vk0-f43.google.com [209.85.213.43]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 11AFF5F472 for ; Tue, 15 Nov 2016 19:23:24 +0000 (UTC) Received: by mail-vk0-f43.google.com with SMTP id x186so97106863vkd.1 for ; Tue, 15 Nov 2016 11:23:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=T3qoOvxY+IyWFlBX7flkAoVbU5kOnsKDQGBB8QqUArk=; b=Qpir/+oMfIsLMOR+s2lcED6gdZddt23LrqMZfPOAqV5XywgcGAxVGxgdHD/w0qnwo5 efdTpDlx3T5QkQKwprBMaSKMAtJjnoVo0/Oc9W6i13ovvgwYzgi9l5wJPV89GtDWU8am Wu57V3dtyo+z+ag2XDIKG16WK98KNyVlnWBWckBAFnDQEvUS4/DZNsKe3fjUq3V8rXAh Dk7Jb63aJMINLzTBp3gdHKTyjLIVGCUuL6ZXE7WQydCD/gPgfZVdNt8+xBKAdbtg//KK Serk1C8C0Enn4Fxyetg26TruXxwQNe26dol7Dxah/rhe8czlsSYrfDtyi1kafnSITo5C YOGg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=T3qoOvxY+IyWFlBX7flkAoVbU5kOnsKDQGBB8QqUArk=; b=ShMuj5sBOE7UttsJkNgAu+5XsZuKSw5AFt2CsRpI7XE4xktV5O5OI+BW0DM3ROoXQ+ xzD0hQrheGuEJq4GppjAoayCzSWDUpSJ24kD8v2sSJqAYPqqz45Rc+JdjXE/qeaE+JIa w3kAvaeQtAb6eDal1urwTq0wwWAqgwxzJzyCPpIESky5er995jkx/2eMDTvfnvZJiEnv TNlneIE9xb1C1HxaBC4cbipor/zgZdmFOvsG4o9+qRwIas/RwmnuLzDQ+MzwmQx+qCO/ KD8R4+O1IGOKmOiHWVDP52x07rPeP5mhxQ7GjA1CIomEEL1SuI30hSRzhus9Yc5zDn8v An/Q== X-Gm-Message-State: ABUngvfSX4SJSW2suAWqyTUnD5f7/99rNtj++w8rq1H9EKy0BJNjLA9nvGW8ZTiTU8Gk3OLHAeky09E8Fnj60A== X-Received: by 10.31.98.71 with SMTP id w68mr14533328vkb.20.1479237800227; Tue, 15 Nov 2016 11:23:20 -0800 (PST) MIME-Version: 1.0 Received: by 10.176.82.219 with HTTP; Tue, 15 Nov 2016 11:23:19 -0800 (PST) In-Reply-To: References: From: Bhuvan Rawal Date: Wed, 16 Nov 2016 00:53:19 +0530 Message-ID: Subject: Re: High system CPU during high write workload To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=94eb2c094e6a9ef76305415be3b2 archived-at: Tue, 15 Nov 2016 19:23:30 -0000 --94eb2c094e6a9ef76305415be3b2 Content-Type: text/plain; charset=UTF-8 Hi Ben, Thanks for your reply, we tested the same workload on kernel version 4.6.4-1.el7.elrepo.x86_64 and found the issue to be not present there. This had resulted in really high CPU in write workloads -> area in which cassandra excels. Degrading performance by atleast 5x! I suggest this mention could be included in cassandra community wiki as it could impact a large audience. Thanks & Regards, Bhuvan On Tue, Nov 15, 2016 at 12:33 PM, Ben Bromhead wrote: > Hi Abhishek > > The article with the futex bug description lists the solution, which is to > upgrade to a version of RHEL or CentOS that have the specified patch. > > What help do you specifically need? If you need help upgrading the OS I > would look at the documentation for RHEL or CentOS. > > Ben > > On Mon, 14 Nov 2016 at 22:48 Abhishek Gupta > wrote: > > Hi, > > We are seeing an issue where the system CPU is shooting off to a figure or > > 90% when the cluster is subjected to a relatively high write workload i.e > 4k wreq/secs. > > 2016-11-14T13:27:47.900+0530 Process summary > process cpu=695.61% > application cpu=676.11% (*user=200.63% sys=475.49%) **<== Very High > System CPU * > other: cpu=19.49% > heap allocation rate *403mb*/s > [000533] user= 1.43% sys= 6.91% alloc= 2216kb/s - SharedPool-Worker-129 > [000274] user= 0.38% sys= 7.78% alloc= 2415kb/s - SharedPool-Worker-34 > [000292] user= 1.24% sys= 6.77% alloc= 2196kb/s - SharedPool-Worker-56 > [000487] user= 1.24% sys= 6.69% alloc= 2260kb/s - SharedPool-Worker-79 > [000488] user= 1.24% sys= 6.56% alloc= 2064kb/s - SharedPool-Worker-78 > [000258] user= 1.05% sys= 6.66% alloc= 2250kb/s - SharedPool-Worker-41 > > On doing strace it was found that the following system call is consuming > all the system CPU > timeout 10s strace -f -p 5954 -c -q > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > > *88.33 1712.798399 16674 102723 22191 futex* 3.98 > 77.098730 4356 17700 read > 3.27 63.474795 394253 161 29 restart_syscall > 3.23 62.601530 29768 2103 epoll_wait > > On searching we found the following bug with the RHEL 6.6, CentOS 6.6 > kernel seems to be a probable cause for the issue: > > https://docs.datastax.com/en/landing_page/doc/landing_page/ > troubleshooting/cassandra/fetuxWaitBug.html > > The patch fix mentioned in the doc is also not present in our kernel. > > sudo rpm -q --changelog kernel-`uname -r` | grep futex | grep ref > - [kernel] futex_lock_pi() key refcnt fix (Danny Feng) [566347] > {CVE-2010-0623} > > Can some who has faced and resolved this issue help us here. > > Thanks, > Abhishek > > > -- > Ben Bromhead > CTO | Instaclustr > +1 650 284 9692 > Managed Cassandra / Spark on AWS, Azure and Softlayer > --94eb2c094e6a9ef76305415be3b2 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Ben,

Thanks for your reply, we teste= d the same workload on kernel version=C2=A04.6.4-1.el7.elrepo.x86_64 and fo= und the issue to be not present there.

This had re= sulted in really high CPU in write workloads -> area in which cassandra = excels. Degrading performance by atleast 5x! I suggest this mention could b= e included in cassandra community wiki as it could impact a large audience.=

Thanks & Regards,
Bhuvan

On Tue, Nov 15, = 2016 at 12:33 PM, Ben Bromhead <ben@instaclustr.com> wrote= :
Hi Abhishek

The article with the futex bug description lists the so= lution, which is to upgrade to a version of RHEL or CentOS that have the sp= ecified patch.=C2=A0
What help do you spe= cifically need? If you need help upgrading the OS I would look at the docum= entation for RHEL or CentOS.

Ben

On Mon, 14 Nov 2016 at 22:48 Abhishek Gup= ta <gupta.abhishek@snapdeal.com> = wrote:
Hi,

We are seeing an issue where the system CPU is shooting off to a fig= ure or > 90% when the cluster is subjected to a relatively high write wo= rkload i.e 4k wreq/secs.

2016-11-14T13:27:47.900+0530 Process summary=C2=A0
=C2=A0 process cpu= =3D695.61%
=C2=A0 application cpu=3D676.11% (user=3D200.63%=C2=A0sys=3D475.49%)=C2=A0<=3D=3D Very Hi= gh System CPU=C2=A0
=C2=A0 other: cpu=3D19.49%=C2=A0
=C2=A0 heap allocation rate=C2=A0403mb/s
[000533]= user=3D 1.43% sys=3D 6.91% alloc=3D 2216kb/s - SharedPool-Worker-129
[000274] user=3D 0.38% sys=3D 7.78% al= loc=3D 2415kb/s - SharedPool-Worker-34
[000292] user=3D 1.24% sys=3D 6.77% alloc= =3D 2196kb/s - SharedPool-Worker-56
[000487] user=3D 1.24% sys=3D 6.69% alloc=3D= 2260kb/s - SharedPool-Worker-79
[000488] user=3D 1.24% sys=3D 6.56% alloc=3D 20= 64kb/s - SharedPool-Worker-78
[000258] user=3D 1.05% sys=3D 6.66% alloc=3D 2250k= b/s - SharedPool-Worker-41

<= /font>
On doing strace= =C2=A0it was found that the following system call is consuming all the syst= em CPU
=C2=A0timeout 10s strace -f -p 5954 -c -q
% time =C2=A0 =C2=A0 seconds =C2=A0usecs/call =C2=A0 =C2=A0 calls = =C2=A0 =C2=A0errors syscall
------ = ----------- ----------- --------- --------- ----------------
88.3= 3 1712.798399 =C2=A0 =C2=A0 =C2=A0 16674 =C2=A0 =C2=A0102723 =C2=A0 =C2=A0 = 22191 futex
=C2=A03.98 =C2=A0 7= 7.098730 =C2=A0 =C2=A0 =C2=A0 =C2=A04356 =C2=A0 =C2=A0 17700 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 read
=C2=A03.2= 7 =C2=A0 63.474795 =C2=A0 =C2=A0 =C2=A0394253 =C2=A0 =C2=A0 =C2=A0 161 =C2= =A0 =C2=A0 =C2=A0 =C2=A029 restart_syscall
=C2=A03.23 =C2=A0 62.601530 =C2=A0 =C2=A0 =C2=A0 29768 =C2=A0 =C2= =A0 =C2=A02103 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 epoll_wait

On searching we found the=C2=A0following bu= g with the RHEL 6.6, CentOS 6.6 kernel seems to be a probable cause for the= issue:

https://docs.datastax.com/en/landing_page/doc/land= ing_page/troubleshooting/cassandra/fetuxWaitBug.html

The patch fix mentioned in t= he doc is also not present in our kernel.

sudo rpm -q --changelog kernel-`una= me -r` | grep futex | grep ref
- [k= ernel] futex_lock_pi() key refcnt fix (Danny Feng) [566347] {CVE-2010-0623}=

Can some who has faced and resolved this = issue help us here.

= Thanks,
Abhishek

=

--
Ben Bromhead
CTO |=C2=A0Instaclustr
+1 650 284 9692<= /div>
Managed Cassandra / Spark on AWS, Azure and Softlayer
=

--94eb2c094e6a9ef76305415be3b2--