Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Fri, 30 Jun 2017 14:04:00 +0000 (UTC)
From: "Corentin Chary (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.13083792.1498831380000.145936.1498831440321@Atlassian.JIRA>
In-Reply-To: <JIRA.13083792.1498831380000@Atlassian.JIRA>
References: <JIRA.13083792.1498831380000@Atlassian.JIRA> <JIRA.13083792.1498831380245@jira-lw-us.apache.org>
Subject: [jira] [Updated] (CASSANDRA-13651) Large amount of CPU used by
 epoll_wait(.., .., .., 0)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
archived-at: Fri, 30 Jun 2017 14:04:10 -0000


     [ https://issues.apache.org/jira/browse/CASSANDRA-13651?page=3Dcom.atl=
assian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Corentin Chary updated CASSANDRA-13651:
---------------------------------------
    Description:=20
I was trying to profile Cassandra under my workload and I kept seeing this =
backtrace:
{code}
epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms
io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java (=
native)
io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int) Native.j=
ava:111
io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) EpollEventLoop.jav=
a:230
io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254
io.netty.util.concurrent.SingleThreadEventExecutor$5.run() SingleThreadEven=
tExecutor.java:858
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(=
) DefaultThreadFactory.java:138
java.lang.Thread.run() Thread.java:745
{code}

At fist I though that the profiler might not be able to profile native code=
 properly, but I wen't further and I realized that most of the CPU was used=
 by {{epoll_wait()}} calls with a timeout of zero.

Here is the output of perf on this system, which confirms that most of the =
overhead was with timeout =3D=3D 0.

{code}
Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (approx.=
): 11594448
Overhead  Trace output                                                     =
                                                                           =
                                                           =E2=97=86
  90.06%  epfd: 0x00000047, events: 0x7f5588c0c000, maxevents: 0x00002000, =
timeout: 0x00000000                                                        =
                                                           =E2=96=92
   5.77%  epfd: 0x000000b5, events: 0x7fca419ef000, maxevents: 0x00001000, =
timeout: 0x00000000                                                        =
                                                           =E2=96=92
   1.98%  epfd: 0x000000b5, events: 0x7fca419ef000, maxevents: 0x00001000, =
timeout: 0x000003e8                                                        =
                                                           =E2=96=92
   0.04%  epfd: 0x00000003, events: 0x2f6af77b9c00, maxevents: 0x00000020, =
timeout: 0x00000000                                                        =
                                                           =E2=96=92
   0.04%  epfd: 0x0000002b, events: 0x121ebf63ac00, maxevents: 0x00000040, =
timeout: 0x00000000                                                        =
                                                           =E2=96=92
   0.03%  epfd: 0x00000026, events: 0x7f51f80019c0, maxevents: 0x00000020, =
timeout: 0x00000000                                                        =
                                                           =E2=96=92
   0.02%  epfd: 0x00000003, events: 0x7fe4d80019d0, maxevents: 0x00000020, =
timeout: 0x00000000
{code}

Running this time with perf record -ag for call traces:
{code}
# Children      Self       sys       usr  Trace output                     =
                                                  =20
# ........  ........  ........  ........  .................................=
...................................................
#
     8.61%     8.61%     0.00%     8.61%  epfd: 0x000000a7, events: 0x7fca4=
52d6000, maxevents: 0x00001000, timeout: 0x00000000
            |
            ---0x1000200af313
               |         =20
                --8.61%--0x7fca6117bdac
                          0x7fca60459804
                          epoll_wait

     2.98%     2.98%     0.00%     2.98%  epfd: 0x000000a7, events: 0x7fca4=
52d6000, maxevents: 0x00001000, timeout: 0x000003e8
            |
            ---0x1000200af313
               0x7fca6117b830
               0x7fca60459804
               epoll_wait
{code}

That looks like a lot of CPU used to wait for nothing. I'm not sure if pref=
 reports a per-CPU percentage or a per-system percentage, but that would be=
 still be 10% of the total CPU usage of Cassandra at the minimum.

I went further and found the code of all that: We schedule a lot of {{Messa=
ge::Flusher}} with a deadline of 10 usec (5 per messages I think) but netty=
+epoll only support timeouts above the milliseconds and will convert everyt=
hing bellow to 0.

I added some traces to netty (4.1):
{code}
diff --git a/transport-native-epoll/src/main/java/io/netty/channel/epoll/Ep=
ollEventLoop.java b/transport-native-epoll/src/main/java/io/netty/channel/e=
poll/EpollEventLoop.java
index 909088fde..8734bbfd4 100644
--- a/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEven=
tLoop.java
+++ b/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEven=
tLoop.java
@@ -208,10 +208,15 @@ final class EpollEventLoop extends SingleThreadEventL=
oop {
         long currentTimeNanos =3D System.nanoTime();
         long selectDeadLineNanos =3D currentTimeNanos + delayNanos(current=
TimeNanos);
         for (;;) {
-            long timeoutMillis =3D (selectDeadLineNanos - currentTimeNanos=
 + 500000L) / 1000000L;
+            long timeoutNanos =3D selectDeadLineNanos - currentTimeNanos +=
 500000L;
+            long timeoutMillis =3D  timeoutNanos / 1000000L;
+            System.out.printf("timeoutNanos: %d, timeoutMillis: %d | deadl=
ine: %d - now: %d | hastask: %d\n",
+                    timeoutNanos, timeoutMillis,
+                    selectDeadLineNanos, currentTimeNanos, hasTasks() ? 1 =
: 0);
             if (timeoutMillis <=3D 0) {
                 if (selectCnt =3D=3D 0) {
                     int ready =3D Native.epollWait(epollFd.intValue(), eve=
nts, 0);
+                    System.out.printf("ready: %d\n", ready);
                     if (ready > 0) {
                         return ready;
                     }
{code}

And this gives :

{code}
timeoutNanos: 1000500000, timeoutMillis: 1000 | deadline: 2001782341816510 =
- now: 2001781341816510 | hastask: 0
timeoutNanos: 1000500000, timeoutMillis: 1000 | deadline: 2001782342087239 =
- now: 2001781342087239 | hastask: 0
timeoutNanos: 1000500000, timeoutMillis: 1000 | deadline: 2001782342166947 =
- now: 2001781342166947 | hastask: 0
timeoutNanos: 508459, timeoutMillis: 0 | deadline: 2001781342297987 - now: =
2001781342289528 | hastask: 0
ready: 0
timeoutNanos: 508475, timeoutMillis: 0 | deadline: 2001781342357719 - now: =
2001781342349244 | hastask: 0
ready: 0
timeoutNanos: 509327, timeoutMillis: 0 | deadline: 2001781342394822 - now: =
2001781342385495 | hastask: 0
ready: 0
timeoutNanos: 509339, timeoutMillis: 0 | deadline: 2001781342430192 - now: =
2001781342420853 | hastask: 0
ready: 0
timeoutNanos: 509510, timeoutMillis: 0 | deadline: 2001781342461588 - now: =
2001781342452078 | hastask: 0
ready: 0
timeoutNanos: 509493, timeoutMillis: 0 | deadline: 2001781342495044 - now: =
2001781342485551 | hastask: 0
ready: 0
{code}

The nanosecond timeout all come from {{eventLoop.schedule(this, 10000, Time=
Unit.NANOSECONDS);}} in {{Message::Flusher}}.

Knowing that, I'm not sure what would be best to do, and I have a hard time=
 understanding Message::Flusher, but to me it looks like trying to schedule=
 less tasks would probably help and I didn't think anything obvious that co=
uld be done with netty.

Changing {{if (++runsWithNoWork > 5)}} to 2 seems to help a little bit, but=
 that isn't really significant.

  was:
I was trying to profile Cassandra under my workload and I kept seeing this =
backtrace:
{code}
epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms
io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java (=
native)
io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int) Native.j=
ava:111
io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) EpollEventLoop.jav=
a:230
io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254
io.netty.util.concurrent.SingleThreadEventExecutor$5.run() SingleThreadEven=
tExecutor.java:858
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(=
) DefaultThreadFactory.java:138
java.lang.Thread.run() Thread.java:745
{code}

At fist I though that the profiler might not be able to profile native code=
 properly, but I wen't further and I realized that most of the CPU was used=
 by epoll_wait() calls with a timeout of zero.

Here is the output of perf on this system, which confirms that most of the =
overhead was with timeout =3D=3D 0.

{code}
Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (approx.=
): 11594448
Overhead  Trace output                                                     =
                                                                           =
                                                           =E2=97=86
  90.06%  epfd: 0x00000047, events: 0x7f5588c0c000, maxevents: 0x00002000, =
timeout: 0x00000000                                                        =
                                                           =E2=96=92
   5.77%  epfd: 0x000000b5, events: 0x7fca419ef000, maxevents: 0x00001000, =
timeout: 0x00000000                                                        =
                                                           =E2=96=92
   1.98%  epfd: 0x000000b5, events: 0x7fca419ef000, maxevents: 0x00001000, =
timeout: 0x000003e8                                                        =
                                                           =E2=96=92
   0.04%  epfd: 0x00000003, events: 0x2f6af77b9c00, maxevents: 0x00000020, =
timeout: 0x00000000                                                        =
                                                           =E2=96=92
   0.04%  epfd: 0x0000002b, events: 0x121ebf63ac00, maxevents: 0x00000040, =
timeout: 0x00000000                                                        =
                                                           =E2=96=92
   0.03%  epfd: 0x00000026, events: 0x7f51f80019c0, maxevents: 0x00000020, =
timeout: 0x00000000                                                        =
                                                           =E2=96=92
   0.02%  epfd: 0x00000003, events: 0x7fe4d80019d0, maxevents: 0x00000020, =
timeout: 0x00000000
{code}

Running this time with perf record -ag for call traces:
{code}
# Children      Self       sys       usr  Trace output                     =
                                                  =20
# ........  ........  ........  ........  .................................=
...................................................
#
     8.61%     8.61%     0.00%     8.61%  epfd: 0x000000a7, events: 0x7fca4=
52d6000, maxevents: 0x00001000, timeout: 0x00000000
            |
            ---0x1000200af313
               |         =20
                --8.61%--0x7fca6117bdac
                          0x7fca60459804
                          epoll_wait

     2.98%     2.98%     0.00%     2.98%  epfd: 0x000000a7, events: 0x7fca4=
52d6000, maxevents: 0x00001000, timeout: 0x000003e8
            |
            ---0x1000200af313
               0x7fca6117b830
               0x7fca60459804
               epoll_wait
{code}

That looks like a lot of CPU used to wait for nothing. I'm not sure if pref=
 reports a per-CPU percentage or a per-system percentage, but that would be=
 still be 10% of the total CPU usage of Cassandra at the minimum.

I went further and found the code of all that: We schedule a lot of Message=
::Flusher with a deadline of 10 usec (5 per messages I think) but netty+epo=
ll only support timeouts above the milliseconds and will convert everything=
 bellow to 0.

I added some traces to netty (4.1):
{code}
diff --git a/transport-native-epoll/src/main/java/io/netty/channel/epoll/Ep=
ollEventLoop.java b/transport-native-epoll/src/main/java/io/netty/channel/e=
poll/EpollEventLoop.java
index 909088fde..8734bbfd4 100644
--- a/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEven=
tLoop.java
+++ b/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEven=
tLoop.java
@@ -208,10 +208,15 @@ final class EpollEventLoop extends SingleThreadEventL=
oop {
         long currentTimeNanos =3D System.nanoTime();
         long selectDeadLineNanos =3D currentTimeNanos + delayNanos(current=
TimeNanos);
         for (;;) {
-            long timeoutMillis =3D (selectDeadLineNanos - currentTimeNanos=
 + 500000L) / 1000000L;
+            long timeoutNanos =3D selectDeadLineNanos - currentTimeNanos +=
 500000L;
+            long timeoutMillis =3D  timeoutNanos / 1000000L;
+            System.out.printf("timeoutNanos: %d, timeoutMillis: %d | deadl=
ine: %d - now: %d | hastask: %d\n",
+                    timeoutNanos, timeoutMillis,
+                    selectDeadLineNanos, currentTimeNanos, hasTasks() ? 1 =
: 0);
             if (timeoutMillis <=3D 0) {
                 if (selectCnt =3D=3D 0) {
                     int ready =3D Native.epollWait(epollFd.intValue(), eve=
nts, 0);
+                    System.out.printf("ready: %d\n", ready);
                     if (ready > 0) {
                         return ready;
                     }
{code}

And this gives :

{code}
timeoutNanos: 1000500000, timeoutMillis: 1000 | deadline: 2001782341816510 =
- now: 2001781341816510 | hastask: 0
timeoutNanos: 1000500000, timeoutMillis: 1000 | deadline: 2001782342087239 =
- now: 2001781342087239 | hastask: 0
timeoutNanos: 1000500000, timeoutMillis: 1000 | deadline: 2001782342166947 =
- now: 2001781342166947 | hastask: 0
timeoutNanos: 508459, timeoutMillis: 0 | deadline: 2001781342297987 - now: =
2001781342289528 | hastask: 0
ready: 0
timeoutNanos: 508475, timeoutMillis: 0 | deadline: 2001781342357719 - now: =
2001781342349244 | hastask: 0
ready: 0
timeoutNanos: 509327, timeoutMillis: 0 | deadline: 2001781342394822 - now: =
2001781342385495 | hastask: 0
ready: 0
timeoutNanos: 509339, timeoutMillis: 0 | deadline: 2001781342430192 - now: =
2001781342420853 | hastask: 0
ready: 0
timeoutNanos: 509510, timeoutMillis: 0 | deadline: 2001781342461588 - now: =
2001781342452078 | hastask: 0
ready: 0
timeoutNanos: 509493, timeoutMillis: 0 | deadline: 2001781342495044 - now: =
2001781342485551 | hastask: 0
ready: 0
{code}

The nanosecond timeout all come from {{eventLoop.schedule(this, 10000, Time=
Unit.NANOSECONDS);}} in {{Message::Flusher}}.

Knowing that, I'm not sure what would be best to do, and I have a hard time=
 understanding Message::Flusher, but to me it looks like trying to schedule=
 less tasks would probably help and I didn't think anything obvious that co=
uld be done with netty.

Changing {{if (++runsWithNoWork > 5)}} to 2 seems to help a little bit, but=
 that isn't really significant.


> Large amount of CPU used by epoll_wait(.., .., .., 0)
> -----------------------------------------------------
>
>                 Key: CASSANDRA-13651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1365=
1
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Corentin Chary
>             Fix For: 4.x
>
>
> I was trying to profile Cassandra under my workload and I kept seeing thi=
s backtrace:
> {code}
> epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms
> io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java=
 (native)
> io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int) Native=
.java:111
> io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) EpollEventLoop.j=
ava:230
> io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run() SingleThreadEv=
entExecutor.java:858
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.ru=
n() DefaultThreadFactory.java:138
> java.lang.Thread.run() Thread.java:745
> {code}
> At fist I though that the profiler might not be able to profile native co=
de properly, but I wen't further and I realized that most of the CPU was us=
ed by {{epoll_wait()}} calls with a timeout of zero.
> Here is the output of perf on this system, which confirms that most of th=
e overhead was with timeout =3D=3D 0.
> {code}
> Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (appro=
x.): 11594448
> Overhead  Trace output                                                   =
                                                                           =
                                                             =E2=97=86
>   90.06%  epfd: 0x00000047, events: 0x7f5588c0c000, maxevents: 0x00002000=
, timeout: 0x00000000                                                      =
                                                             =E2=96=92
>    5.77%  epfd: 0x000000b5, events: 0x7fca419ef000, maxevents: 0x00001000=
, timeout: 0x00000000                                                      =
                                                             =E2=96=92
>    1.98%  epfd: 0x000000b5, events: 0x7fca419ef000, maxevents: 0x00001000=
, timeout: 0x000003e8                                                      =
                                                             =E2=96=92
>    0.04%  epfd: 0x00000003, events: 0x2f6af77b9c00, maxevents: 0x00000020=
, timeout: 0x00000000                                                      =
                                                             =E2=96=92
>    0.04%  epfd: 0x0000002b, events: 0x121ebf63ac00, maxevents: 0x00000040=
, timeout: 0x00000000                                                      =
                                                             =E2=96=92
>    0.03%  epfd: 0x00000026, events: 0x7f51f80019c0, maxevents: 0x00000020=
, timeout: 0x00000000                                                      =
                                                             =E2=96=92
>    0.02%  epfd: 0x00000003, events: 0x7fe4d80019d0, maxevents: 0x00000020=
, timeout: 0x00000000
> {code}
> Running this time with perf record -ag for call traces:
> {code}
> # Children      Self       sys       usr  Trace output                   =
                                                    =20
> # ........  ........  ........  ........  ...............................=
.....................................................
> #
>      8.61%     8.61%     0.00%     8.61%  epfd: 0x000000a7, events: 0x7fc=
a452d6000, maxevents: 0x00001000, timeout: 0x00000000
>             |
>             ---0x1000200af313
>                |         =20
>                 --8.61%--0x7fca6117bdac
>                           0x7fca60459804
>                           epoll_wait
>      2.98%     2.98%     0.00%     2.98%  epfd: 0x000000a7, events: 0x7fc=
a452d6000, maxevents: 0x00001000, timeout: 0x000003e8
>             |
>             ---0x1000200af313
>                0x7fca6117b830
>                0x7fca60459804
>                epoll_wait
> {code}
> That looks like a lot of CPU used to wait for nothing. I'm not sure if pr=
ef reports a per-CPU percentage or a per-system percentage, but that would =
be still be 10% of the total CPU usage of Cassandra at the minimum.
> I went further and found the code of all that: We schedule a lot of {{Mes=
sage::Flusher}} with a deadline of 10 usec (5 per messages I think) but net=
ty+epoll only support timeouts above the milliseconds and will convert ever=
ything bellow to 0.
> I added some traces to netty (4.1):
> {code}
> diff --git a/transport-native-epoll/src/main/java/io/netty/channel/epoll/=
EpollEventLoop.java b/transport-native-epoll/src/main/java/io/netty/channel=
/epoll/EpollEventLoop.java
> index 909088fde..8734bbfd4 100644
> --- a/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEv=
entLoop.java
> +++ b/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEv=
entLoop.java
> @@ -208,10 +208,15 @@ final class EpollEventLoop extends SingleThreadEven=
tLoop {
>          long currentTimeNanos =3D System.nanoTime();
>          long selectDeadLineNanos =3D currentTimeNanos + delayNanos(curre=
ntTimeNanos);
>          for (;;) {
> -            long timeoutMillis =3D (selectDeadLineNanos - currentTimeNan=
os + 500000L) / 1000000L;
> +            long timeoutNanos =3D selectDeadLineNanos - currentTimeNanos=
 + 500000L;
> +            long timeoutMillis =3D  timeoutNanos / 1000000L;
> +            System.out.printf("timeoutNanos: %d, timeoutMillis: %d | dea=
dline: %d - now: %d | hastask: %d\n",
> +                    timeoutNanos, timeoutMillis,
> +                    selectDeadLineNanos, currentTimeNanos, hasTasks() ? =
1 : 0);
>              if (timeoutMillis <=3D 0) {
>                  if (selectCnt =3D=3D 0) {
>                      int ready =3D Native.epollWait(epollFd.intValue(), e=
vents, 0);
> +                    System.out.printf("ready: %d\n", ready);
>                      if (ready > 0) {
>                          return ready;
>                      }
> {code}
> And this gives :
> {code}
> timeoutNanos: 1000500000, timeoutMillis: 1000 | deadline: 200178234181651=
0 - now: 2001781341816510 | hastask: 0
> timeoutNanos: 1000500000, timeoutMillis: 1000 | deadline: 200178234208723=
9 - now: 2001781342087239 | hastask: 0
> timeoutNanos: 1000500000, timeoutMillis: 1000 | deadline: 200178234216694=
7 - now: 2001781342166947 | hastask: 0
> timeoutNanos: 508459, timeoutMillis: 0 | deadline: 2001781342297987 - now=
: 2001781342289528 | hastask: 0
> ready: 0
> timeoutNanos: 508475, timeoutMillis: 0 | deadline: 2001781342357719 - now=
: 2001781342349244 | hastask: 0
> ready: 0
> timeoutNanos: 509327, timeoutMillis: 0 | deadline: 2001781342394822 - now=
: 2001781342385495 | hastask: 0
> ready: 0
> timeoutNanos: 509339, timeoutMillis: 0 | deadline: 2001781342430192 - now=
: 2001781342420853 | hastask: 0
> ready: 0
> timeoutNanos: 509510, timeoutMillis: 0 | deadline: 2001781342461588 - now=
: 2001781342452078 | hastask: 0
> ready: 0
> timeoutNanos: 509493, timeoutMillis: 0 | deadline: 2001781342495044 - now=
: 2001781342485551 | hastask: 0
> ready: 0
> {code}
> The nanosecond timeout all come from {{eventLoop.schedule(this, 10000, Ti=
meUnit.NANOSECONDS);}} in {{Message::Flusher}}.
> Knowing that, I'm not sure what would be best to do, and I have a hard ti=
me understanding Message::Flusher, but to me it looks like trying to schedu=
le less tasks would probably help and I didn't think anything obvious that =
could be done with netty.
> Changing {{if (++runsWithNoWork > 5)}} to 2 seems to help a little bit, b=
ut that isn't really significant.


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org