Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 4072B200CE6 for ; Wed, 2 Aug 2017 07:33:07 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 3ED9C1688B6; Wed, 2 Aug 2017 05:33:07 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 5CDAA1688B3 for ; Wed, 2 Aug 2017 07:33:06 +0200 (CEST) Received: (qmail 3155 invoked by uid 500); 2 Aug 2017 05:33:04 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 3019 invoked by uid 99); 2 Aug 2017 05:33:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Aug 2017 05:33:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 6C9C41A2365 for ; Wed, 2 Aug 2017 05:33:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 9hgmmsQfKUdy for ; Wed, 2 Aug 2017 05:33:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id CF0285F30C for ; Wed, 2 Aug 2017 05:33:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 5CFC1E0D71 for ; Wed, 2 Aug 2017 05:33:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 0363B24654 for ; Wed, 2 Aug 2017 05:33:01 +0000 (UTC) Date: Wed, 2 Aug 2017 05:33:01 +0000 (UTC) From: "Norman Maurer (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-13651) Large amount of CPU used by epoll_wait(.., .., .., 0) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 02 Aug 2017 05:33:07 -0000 [ https://issues.apache.org/jira/browse/CASSANDRA-13651?page=3Dcom.atla= ssian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId= =3D16110335#comment-16110335 ]=20 Norman Maurer commented on CASSANDRA-13651: ------------------------------------------- Sorry for the late response.. didn't see the mention :( So yes netty uses `epoll_wait` which only supports milli-seconds resulution= so everything smaller then this will just cause `epoll_wait(...)` be calle= d with a `0` and so a non-blocking check of ready fds. What we could do in = our native transport implementation is that we make use of `timerfd` [1] to= schedule timeouts but again this would only work for the case of using the= native epoll transport and not when you use the nio transport (which works= on all OS). So I think what you really want to do is have timeouts of >=3D= 1ms.=20 Comments and ideas welcome :) > Large amount of CPU used by epoll_wait(.., .., .., 0) > ----------------------------------------------------- > > Key: CASSANDRA-13651 > URL: https://issues.apache.org/jira/browse/CASSANDRA-1365= 1 > Project: Cassandra > Issue Type: Bug > Reporter: Corentin Chary > Fix For: 4.x > > Attachments: cpu-usage.png > > > I was trying to profile Cassandra under my workload and I kept seeing thi= s backtrace: > {code} > epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms > io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java= (native) > io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int) Native= .java:111 > io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) EpollEventLoop.j= ava:230 > io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254 > io.netty.util.concurrent.SingleThreadEventExecutor$5.run() SingleThreadEv= entExecutor.java:858 > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.ru= n() DefaultThreadFactory.java:138 > java.lang.Thread.run() Thread.java:745 > {code} > At fist I though that the profiler might not be able to profile native co= de properly, but I wen't further and I realized that most of the CPU was us= ed by {{epoll_wait()}} calls with a timeout of zero. > Here is the output of perf on this system, which confirms that most of th= e overhead was with timeout =3D=3D 0. > {code} > Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (appro= x.): 11594448 > Overhead Trace output = = =E2=97=86 > 90.06% epfd: 0x00000047, events: 0x7f5588c0c000, maxevents: 0x00002000= , timeout: 0x00000000 = =E2=96=92 > 5.77% epfd: 0x000000b5, events: 0x7fca419ef000, maxevents: 0x00001000= , timeout: 0x00000000 = =E2=96=92 > 1.98% epfd: 0x000000b5, events: 0x7fca419ef000, maxevents: 0x00001000= , timeout: 0x000003e8 = =E2=96=92 > 0.04% epfd: 0x00000003, events: 0x2f6af77b9c00, maxevents: 0x00000020= , timeout: 0x00000000 = =E2=96=92 > 0.04% epfd: 0x0000002b, events: 0x121ebf63ac00, maxevents: 0x00000040= , timeout: 0x00000000 = =E2=96=92 > 0.03% epfd: 0x00000026, events: 0x7f51f80019c0, maxevents: 0x00000020= , timeout: 0x00000000 = =E2=96=92 > 0.02% epfd: 0x00000003, events: 0x7fe4d80019d0, maxevents: 0x00000020= , timeout: 0x00000000 > {code} > Running this time with perf record -ag for call traces: > {code} > # Children Self sys usr Trace output = =20 > # ........ ........ ........ ........ ...............................= ..................................................... > # > 8.61% 8.61% 0.00% 8.61% epfd: 0x000000a7, events: 0x7fc= a452d6000, maxevents: 0x00001000, timeout: 0x00000000 > | > ---0x1000200af313 > | =20 > --8.61%--0x7fca6117bdac > 0x7fca60459804 > epoll_wait > 2.98% 2.98% 0.00% 2.98% epfd: 0x000000a7, events: 0x7fc= a452d6000, maxevents: 0x00001000, timeout: 0x000003e8 > | > ---0x1000200af313 > 0x7fca6117b830 > 0x7fca60459804 > epoll_wait > {code} > That looks like a lot of CPU used to wait for nothing. I'm not sure if pr= ef reports a per-CPU percentage or a per-system percentage, but that would = be still be 10% of the total CPU usage of Cassandra at the minimum. > I went further and found the code of all that: We schedule a lot of {{Mes= sage::Flusher}} with a deadline of 10 usec (5 per messages I think) but net= ty+epoll only support timeouts above the milliseconds and will convert ever= ything bellow to 0. > I added some traces to netty (4.1): > {code} > diff --git a/transport-native-epoll/src/main/java/io/netty/channel/epoll/= EpollEventLoop.java b/transport-native-epoll/src/main/java/io/netty/channel= /epoll/EpollEventLoop.java > index 909088fde..8734bbfd4 100644 > --- a/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEv= entLoop.java > +++ b/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEv= entLoop.java > @@ -208,10 +208,15 @@ final class EpollEventLoop extends SingleThreadEven= tLoop { > long currentTimeNanos =3D System.nanoTime(); > long selectDeadLineNanos =3D currentTimeNanos + delayNanos(curre= ntTimeNanos); > for (;;) { > - long timeoutMillis =3D (selectDeadLineNanos - currentTimeNan= os + 500000L) / 1000000L; > + long timeoutNanos =3D selectDeadLineNanos - currentTimeNanos= + 500000L; > + long timeoutMillis =3D timeoutNanos / 1000000L; > + System.out.printf("timeoutNanos: %d, timeoutMillis: %d | dea= dline: %d - now: %d | hastask: %d\n", > + timeoutNanos, timeoutMillis, > + selectDeadLineNanos, currentTimeNanos, hasTasks() ? = 1 : 0); > if (timeoutMillis <=3D 0) { > if (selectCnt =3D=3D 0) { > int ready =3D Native.epollWait(epollFd.intValue(), e= vents, 0); > + System.out.printf("ready: %d\n", ready); > if (ready > 0) { > return ready; > } > {code} > And this gives : > {code} > timeoutNanos: 1000500000, timeoutMillis: 1000 | deadline: 200178234181651= 0 - now: 2001781341816510 | hastask: 0 > timeoutNanos: 1000500000, timeoutMillis: 1000 | deadline: 200178234208723= 9 - now: 2001781342087239 | hastask: 0 > timeoutNanos: 1000500000, timeoutMillis: 1000 | deadline: 200178234216694= 7 - now: 2001781342166947 | hastask: 0 > timeoutNanos: 508459, timeoutMillis: 0 | deadline: 2001781342297987 - now= : 2001781342289528 | hastask: 0 > ready: 0 > timeoutNanos: 508475, timeoutMillis: 0 | deadline: 2001781342357719 - now= : 2001781342349244 | hastask: 0 > ready: 0 > timeoutNanos: 509327, timeoutMillis: 0 | deadline: 2001781342394822 - now= : 2001781342385495 | hastask: 0 > ready: 0 > timeoutNanos: 509339, timeoutMillis: 0 | deadline: 2001781342430192 - now= : 2001781342420853 | hastask: 0 > ready: 0 > timeoutNanos: 509510, timeoutMillis: 0 | deadline: 2001781342461588 - now= : 2001781342452078 | hastask: 0 > ready: 0 > timeoutNanos: 509493, timeoutMillis: 0 | deadline: 2001781342495044 - now= : 2001781342485551 | hastask: 0 > ready: 0 > {code} > The nanosecond timeout all come from {{eventLoop.schedule(this, 10000, Ti= meUnit.NANOSECONDS);}} in {{Message::Flusher}}. > Knowing that, I'm not sure what would be best to do, and I have a hard ti= me understanding Message::Flusher, but to me it looks like trying to schedu= le less tasks would probably help and I didn't think anything obvious that = could be done with netty. > Changing {{if (++runsWithNoWork > 5)}} to 2 seems to help a little bit, b= ut that isn't really significant. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org For additional commands, e-mail: commits-help@cassandra.apache.org