reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergiy Matusevych (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (REEF-1729) Fix test job timeouts in Travis CI
Date Thu, 09 Mar 2017 00:00:41 GMT

    [ https://issues.apache.org/jira/browse/REEF-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15902195#comment-15902195
] 

Sergiy Matusevych edited comment on REEF-1729 at 3/9/17 12:00 AM:
------------------------------------------------------------------

I suspect that this and a few other issues are all caused by the new cleanup code required
for [REEF-1561: REEF as a Library|REEF-1561] feature.

We need to review the cleanup process, get rid of potential race conditions, and make sure
that all resources (threads, files, network connections and such) are properly closed and/or
deleted a the end of the REEF job.

The ultimate indicator of successful cleanup implementation would be the completion of [REEF-1715:
Remove System.exit() at the end of the REEF launcher|REEF-1715].

Other issues that might be related to the cleanup process are:
   * [REEF-1729] - Fix test job timeouts in Travis CI
   * [REEF-1726] - Close message dispatcher on the evaluator manager shutdown
   * [REEF-1715] - Remove {{System.exit()}} at the end of the REEF launcher
   * [REEF-1668] - Intermittent failures of {{EvaulatorCloseTest}}
   * [REEF-1661] - {{RejectedExecutionException}} thrown when closing the acceptor in {{NettyMessageTransport}}

[~shouhengyi], [~taegeonum], it would be great if you guys could help me with any of these
issues. you can start with the {{HelloREEF}} and {{HelloREEFYarn}} examples in Java and see
what threads are still running at the end of each process (Client, Driver, and the Evaluators).
Ideally, we should have only the {{main}} thread left - then we can go ahead and remove the
{{System.exit()}} call!


was (Author: motus):
I suspect that this and a few other issues are all caused by the new cleanup code required
for [REEF-1561: REEF as a Library|REEF-1561] feature.

We need to review the cleanup process, get rid of potential race conditions, and make sure
that all resources (threads, files, network connections and such) are properly closed and/or
deleted a the end of the REEF job.

The ultimate indicator of successful cleanup implementation would be the completion of [REEF-1715:
Remove System.exit() at the end of the REEF launcher|REEF-1715].

Other issues that might be related to the cleanup process are:
   * [REEF-1729] - Fix test job timeouts in Travis CI
   * [REEF-1726] - Close message dispatcher on the evaluator manager shutdown
   * [REEF-1715] - Remove {{System.exit()}} at the end of the REEF launcher
   * [REEF-1668] - Intermittent failures of {{EvaulatorCloseTest}}
   * [REEF-1661] - {{RejectedExecutionException}} thrown when closing the acceptor in {{NettyMessageTransport}}

[~shouhengyi], [~taegeonum] It would be great if you guys could help me with any of these
issues. you can start with the {{HelloREEF}} and {{HelloREEFYarn}} examples in Java and see
what threads are still running at the end of each process (Client, Driver, and the Evaluators).
Ideally, we should have only the {{main}} thread left - then we can go ahead and remove the
{{System.exit()}} call!

> Fix test job timeouts in Travis CI
> ----------------------------------
>
>                 Key: REEF-1729
>                 URL: https://issues.apache.org/jira/browse/REEF-1729
>             Project: REEF
>          Issue Type: Bug
>            Reporter: Mariia Mykhailova
>            Assignee: Sergiy Matusevych
>
> Recent changes in the way we're closing threads in Java code during REEF driver shutdown
seem to have introduced a bug in this area. We observe transient test job timeouts in [Travis
CI|https://travis-ci.org/apache/reef/builds/]: typically one test job takes 39-41 minutes,
the limit on job duration is 50 minutes, and we're seeing test jobs hitting the limit and
timing out. There is no test failure reported in such cases, so I suspect there is some runaway
unaccounted for thread or an entire test which fails to complete properly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message