reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tae-Geon Um (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (REEF-1729) Fix test job timeouts in Travis CI
Date Fri, 17 Mar 2017 11:20:41 GMT

    [ https://issues.apache.org/jira/browse/REEF-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929660#comment-15929660
] 

Tae-Geon Um edited comment on REEF-1729 at 3/17/17 11:20 AM:
-------------------------------------------------------------

[~motus] 

The test time increased after [PR#1174|https://github.com/apache/reef/pull/1174] is merged.
Before PR#1174 merged, the time was about 25 minutes ([Build#2163|https://travis-ci.org/apache/reef/builds/171245497]).
However, after PR#1174 merged, the time increased up to 48 minutes ([Build#2175|https://travis-ci.org/apache/reef/builds/172205482]).
As sometimes the build machine could become slow, test job timeouts could happen. 

I don't think the test job timeout is related to runaway thread. Instead, it seems because
of the use of {{[awaitUninterruptibly()|https://github.com/apache/reef/pull/1174/files#diff-569bb7fe6ce70facd5adb2e3a91f0f40R229]}},
which waits some quiet periods to completely release resources. We discussed this issue previously
in [REEF-1231|https://issues.apache.org/jira/browse/REEF-1231]. 

We ca reduce the test running time by doing {{awaitUninterruptibly()}} after closing and shutting
down all of the channels. I've created a PR for it: [PR##1268|https://github.com/apache/reef/pull/1268]


was (Author: taegeonum):
[~motus] 

The test time increased after [PR#1174|https://github.com/apache/reef/pull/1174] is merged.
Before PR#1174 merged, the time was about 25 minutes ([Build#2163|https://travis-ci.org/apache/reef/builds/171245497]).
However, after PR#1174 merged, the time increased up to 48 minutes ([Build#2175|https://travis-ci.org/apache/reef/builds/172205482]).
As sometimes the build machine could become slow, test job timeouts could happen. 

I don't think the test job timeout is related to runaway thread. Instead, it seems because
of the use of {{[awaitUninterruptibly()|https://github.com/apache/reef/pull/1174/files#diff-569bb7fe6ce70facd5adb2e3a91f0f40R229]}},
which waits some quiet periods to completely release resources. We discussed this issue previously
in [REEF-1231|https://issues.apache.org/jira/browse/REEF-1231]. 

I think it would be good to remove the {{awaitUniterruptibly()}} codes that was added in PR#1174.
Even though we delete the code, the {{.close()}} is still idempotent. If you are fine with
this change, I will create a PR for it :)

> Fix test job timeouts in Travis CI
> ----------------------------------
>
>                 Key: REEF-1729
>                 URL: https://issues.apache.org/jira/browse/REEF-1729
>             Project: REEF
>          Issue Type: Bug
>            Reporter: Mariia Mykhailova
>            Assignee: Sergiy Matusevych
>
> Recent changes in the way we're closing threads in Java code during REEF driver shutdown
seem to have introduced a bug in this area. We observe transient test job timeouts in [Travis
CI|https://travis-ci.org/apache/reef/builds/]: typically one test job takes 39-41 minutes,
the limit on job duration is 50 minutes, and we're seeing test jobs hitting the limit and
timing out. There is no test failure reported in such cases, so I suspect there is some runaway
unaccounted for thread or an entire test which fails to complete properly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message