zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bogdan Kanivets (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ZOOKEEPER-3046) testManyChildWatchersAutoReset is flaky
Date Thu, 19 Jul 2018 06:26:00 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-3046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548844#comment-16548844

Bogdan Kanivets commented on ZOOKEEPER-3046:



I've done some digging into the performance of this test by adding more system and time logging.

I've tested using fake [PR|https://github.com/apache/zookeeper/pull/538] to collect data
in apache CI system. I'm using fake 'testCreateManyChildren' that just creates 10k nodes.

Also, I've tested locally by running all tests in the loop and recording all logs until the
disk is full.

My hypothesis:

there is some other write-heavy test or build(maybe hadoop tests) that slows down creation
of 10K children.

Here fake test took 10s


Here fake took 3m 20s

 # I don't think there is a deadlock, because I'm logging 'stat' after every 1K nodes created
and it's pretty consistent.
 ** "Latency min/avg/max: 2/32/596" on slow
 ** "Latency min/avg/max: 0/0/25" on fast
 # Locally, when disk is getting full, I've seen this test slowing down.
 # I've tried logging vmstat, but not very helpful
 ** load average is bit higher on slow "load average: 12.54, 12.04, 10.95"
 ** "load average: 4.16, 4.56, 4.86" on fast
 # Since iostat isn't available, I've logged 'less /proc/diskstats'. Not very meaningful because
I'm logging at differently spaced timestamps right now. But one observation is that column
12 "I/Os currently in progress" is usually 1 or 2 on slow and usually 0 on fast.
 # Looking at column 9 of /proc/diskstats (writes completed) for 'sda' device and calculating
writes per ms using timestamps LOG.info:
 ** 0.18 writes/ms on slow
 ** 9.1 writes/ms on fast
 ** warning - this calculation shouldn't be trusted - I've done it only on one sample 

As a result I've tried to pin point some slow jenkins apache hosts, but wasn't successful.
I don't see the pattern. Also tried looking at jobs that run at the same time, again no luck.

Maybe the next steps can be:
 # Trying to run this test with memory mapped files
 # Somehow pinning the host just for zk builds
 # Getting access to the box and doing better profiling
 # I can try writing a better version of /proc/diskstats stats collector





> testManyChildWatchersAutoReset is flaky
> ---------------------------------------
>                 Key: ZOOKEEPER-3046
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3046
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: tests
>    Affects Versions: 3.5.3, 3.4.12
>            Reporter: Bogdan Kanivets
>            Assignee: Bogdan Kanivets
>            Priority: Minor
>              Labels: flaky, pull-request-available
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
> According to the [dashboard|https://builds.apache.org/job/ZooKeeper-Find-Flaky-Tests/lastSuccessfulBuild/artifact/report.html] testManyChildWatchersAutoReset
is flaky in 3.4 and 3.5
> [ZooKeeper_branch34_java10|https://builds.apache.org/job/ZooKeeper_branch34_java10//13]
> [ZooKeeper_branch35_java9|https://builds.apache.org/job/ZooKeeper_branch35_java9/253]
> Test times out and because of that ant doesn't capture any output.

This message was sent by Atlassian JIRA

View raw message