reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergiy Matusevych (JIRA)" <>
Subject [jira] [Commented] (REEF-1729) Fix test job timeouts in Travis CI
Date Sun, 12 Mar 2017 02:06:06 GMT


Sergiy Matusevych commented on REEF-1729:

I think I'll add more details about the issue.

The cleanup problem is a flipside of initialization via dependency injection. Since we do
not explicitly control the order in which the objects are being created by the Tang Injector,
it is hard for us to manage in which order to invoke the cleanup routines.

Prior to the [REEF-1561: REEF as a Library|REEF-1561] effort, cleanup was never a problem,
because both the Driver and the Evaluator processes were completely under control of the REEF
framework, and we could rely on the {{System.exit()}} call to close all remaining resources.
As a result, many REEF components did not have any cleanup code at all. Now when REEF must
coexist with other applications (e.g. Spark) in the same JVM, it is critical to do a proper
cleanup before returning control from REEF to the caller app.

Java has a universal interface {{AutoCloseable}} that provides a single method {{void AutoCloseable.close()}}.
Every REEF class that requires some cleanup (or may require it in the future), should implement
this interface. Since it is hard for us to control in which order the {{.close()}} methods
will be called, we must specify certain agreements about its behavior.

   1. {{.close()}} methods must be *idempotent*. That is, it should be OK to call the same
method several times; subsequent calls must have no effect.
   2. {{.close()}} methods should *never throw*. It is a well-known design principle in C++
that the destructors should never raise exceptions. We should follow this logic in REEF. Otherwise,
it would be very hard to deal with partial cleanups.

Two principles above also simplify the implementation, since we do not have to worry about
race conditions (i.e. the order and number of times each {{.close()}} method is called), and
about error handling, since each {{.close()}} method is self-contained and never propagates
exceptions outside of its scope.

> Fix test job timeouts in Travis CI
> ----------------------------------
>                 Key: REEF-1729
>                 URL:
>             Project: REEF
>          Issue Type: Bug
>            Reporter: Mariia Mykhailova
>            Assignee: Sergiy Matusevych
> Recent changes in the way we're closing threads in Java code during REEF driver shutdown
seem to have introduced a bug in this area. We observe transient test job timeouts in [Travis
CI|]: typically one test job takes 39-41 minutes,
the limit on job duration is 50 minutes, and we're seeing test jobs hitting the limit and
timing out. There is no test failure reported in such cases, so I suspect there is some runaway
unaccounted for thread or an entire test which fails to complete properly.

This message was sent by Atlassian JIRA

View raw message