cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Corentin Chary (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-13651) Large amount of CPU used by epoll_wait(.., .., .., 0)
Date Tue, 25 Jul 2017 12:36:00 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-13651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099971#comment-16099971
] 

Corentin Chary commented on CASSANDRA-13651:
--------------------------------------------

Latest patch (https://github.com/iksaif/cassandra/tree/cassandra-13651-trunk)  tested with
an actual workload, I attached the screenshot.
Looks like we can get ~2% of CPU back with -Dcassandra.netty_flush_delay_nanoseconds=0 and
some more with -Dio.netty.eventLoopThreads=6. This doesn't not seem to affect latency

!Screenshot (5).png|thumbnail!

> Large amount of CPU used by epoll_wait(.., .., .., 0)
> -----------------------------------------------------
>
>                 Key: CASSANDRA-13651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13651
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Corentin Chary
>             Fix For: 4.x
>
>
> I was trying to profile Cassandra under my workload and I kept seeing this backtrace:
> {code}
> epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms
> io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java (native)
> io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int) Native.java:111
> io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) EpollEventLoop.java:230
> io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run() SingleThreadEventExecutor.java:858
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run() DefaultThreadFactory.java:138
> java.lang.Thread.run() Thread.java:745
> {code}
> At fist I though that the profiler might not be able to profile native code properly,
but I wen't further and I realized that most of the CPU was used by {{epoll_wait()}} calls
with a timeout of zero.
> Here is the output of perf on this system, which confirms that most of the overhead was
with timeout == 0.
> {code}
> Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (approx.): 11594448
> Overhead  Trace output                                                              
                                                                                         
                                  ◆
>   90.06%  epfd: 0x00000047, events: 0x7f5588c0c000, maxevents: 0x00002000, timeout: 0x00000000
                                                                                         
                        ▒
>    5.77%  epfd: 0x000000b5, events: 0x7fca419ef000, maxevents: 0x00001000, timeout: 0x00000000
                                                                                         
                        ▒
>    1.98%  epfd: 0x000000b5, events: 0x7fca419ef000, maxevents: 0x00001000, timeout: 0x000003e8
                                                                                         
                        ▒
>    0.04%  epfd: 0x00000003, events: 0x2f6af77b9c00, maxevents: 0x00000020, timeout: 0x00000000
                                                                                         
                        ▒
>    0.04%  epfd: 0x0000002b, events: 0x121ebf63ac00, maxevents: 0x00000040, timeout: 0x00000000
                                                                                         
                        ▒
>    0.03%  epfd: 0x00000026, events: 0x7f51f80019c0, maxevents: 0x00000020, timeout: 0x00000000
                                                                                         
                        ▒
>    0.02%  epfd: 0x00000003, events: 0x7fe4d80019d0, maxevents: 0x00000020, timeout: 0x00000000
> {code}
> Running this time with perf record -ag for call traces:
> {code}
> # Children      Self       sys       usr  Trace output                              
                                         
> # ........  ........  ........  ........  ....................................................................................
> #
>      8.61%     8.61%     0.00%     8.61%  epfd: 0x000000a7, events: 0x7fca452d6000, maxevents:
0x00001000, timeout: 0x00000000
>             |
>             ---0x1000200af313
>                |          
>                 --8.61%--0x7fca6117bdac
>                           0x7fca60459804
>                           epoll_wait
>      2.98%     2.98%     0.00%     2.98%  epfd: 0x000000a7, events: 0x7fca452d6000, maxevents:
0x00001000, timeout: 0x000003e8
>             |
>             ---0x1000200af313
>                0x7fca6117b830
>                0x7fca60459804
>                epoll_wait
> {code}
> That looks like a lot of CPU used to wait for nothing. I'm not sure if pref reports a
per-CPU percentage or a per-system percentage, but that would be still be 10% of the total
CPU usage of Cassandra at the minimum.
> I went further and found the code of all that: We schedule a lot of {{Message::Flusher}}
with a deadline of 10 usec (5 per messages I think) but netty+epoll only support timeouts
above the milliseconds and will convert everything bellow to 0.
> I added some traces to netty (4.1):
> {code}
> diff --git a/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java
b/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java
> index 909088fde..8734bbfd4 100644
> --- a/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java
> +++ b/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java
> @@ -208,10 +208,15 @@ final class EpollEventLoop extends SingleThreadEventLoop {
>          long currentTimeNanos = System.nanoTime();
>          long selectDeadLineNanos = currentTimeNanos + delayNanos(currentTimeNanos);
>          for (;;) {
> -            long timeoutMillis = (selectDeadLineNanos - currentTimeNanos + 500000L)
/ 1000000L;
> +            long timeoutNanos = selectDeadLineNanos - currentTimeNanos + 500000L;
> +            long timeoutMillis =  timeoutNanos / 1000000L;
> +            System.out.printf("timeoutNanos: %d, timeoutMillis: %d | deadline: %d -
now: %d | hastask: %d\n",
> +                    timeoutNanos, timeoutMillis,
> +                    selectDeadLineNanos, currentTimeNanos, hasTasks() ? 1 : 0);
>              if (timeoutMillis <= 0) {
>                  if (selectCnt == 0) {
>                      int ready = Native.epollWait(epollFd.intValue(), events, 0);
> +                    System.out.printf("ready: %d\n", ready);
>                      if (ready > 0) {
>                          return ready;
>                      }
> {code}
> And this gives :
> {code}
> timeoutNanos: 1000500000, timeoutMillis: 1000 | deadline: 2001782341816510 - now: 2001781341816510
| hastask: 0
> timeoutNanos: 1000500000, timeoutMillis: 1000 | deadline: 2001782342087239 - now: 2001781342087239
| hastask: 0
> timeoutNanos: 1000500000, timeoutMillis: 1000 | deadline: 2001782342166947 - now: 2001781342166947
| hastask: 0
> timeoutNanos: 508459, timeoutMillis: 0 | deadline: 2001781342297987 - now: 2001781342289528
| hastask: 0
> ready: 0
> timeoutNanos: 508475, timeoutMillis: 0 | deadline: 2001781342357719 - now: 2001781342349244
| hastask: 0
> ready: 0
> timeoutNanos: 509327, timeoutMillis: 0 | deadline: 2001781342394822 - now: 2001781342385495
| hastask: 0
> ready: 0
> timeoutNanos: 509339, timeoutMillis: 0 | deadline: 2001781342430192 - now: 2001781342420853
| hastask: 0
> ready: 0
> timeoutNanos: 509510, timeoutMillis: 0 | deadline: 2001781342461588 - now: 2001781342452078
| hastask: 0
> ready: 0
> timeoutNanos: 509493, timeoutMillis: 0 | deadline: 2001781342495044 - now: 2001781342485551
| hastask: 0
> ready: 0
> {code}
> The nanosecond timeout all come from {{eventLoop.schedule(this, 10000, TimeUnit.NANOSECONDS);}}
in {{Message::Flusher}}.
> Knowing that, I'm not sure what would be best to do, and I have a hard time understanding
Message::Flusher, but to me it looks like trying to schedule less tasks would probably help
and I didn't think anything obvious that could be done with netty.
> Changing {{if (++runsWithNoWork > 5)}} to 2 seems to help a little bit, but that isn't
really significant.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org


Mime
View raw message