mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anirudh <anirudh2...@gmail.com>
Subject Re: Improved Exception Handling in MXNet Wiki
Date Sun, 21 Jan 2018 17:09:21 GMT
Thank you for your feedback, Mu and Eftiquar!
Lets say op A writes var V and op B reads V. If there is an exception
thrown during execution of A, the callback for A will still explicitly be
called. This will enforce that all the dependencies are cleared and the
write dependencies have exception_ptr member set if needed. This will
prevent system hang.
I have modified the wiki to make this point clear.

As Mu mentioned, One drawback for this approach is that this will lead to
wrong results for the operators being executed which are dependent on the
failed operator. I thought this behavior to be alright if we document it
clearly that all operators depending on failed operator will produce
unreliable results.

If this is not acceptable for users, and a failure of an operator must stop
execution of all operators depending on it, then we need to use approach 2,
where we use an on-start callback which helps decide before an operator
executes whether to execute it or not. Please advise.

Anirudh

On Sat, Jan 20, 2018 at 3:14 PM, Shaikh, Eftiquar <eftiquar@amazon.com>
wrote:

> A thread reading the result corresponding to an orphaned task can indeed
> cause hang. Good catch.
> The exceptions as well as task results can be passed across threads using
> std::shared_future. If a task thread exited with an exception, the caller
> of std::future::get will get an exception. Assuming the exiting thread
> stored the exception in the corresponding std::promise.
> No exception should escape the boundary of the thread that threw it.
> And the top-level thread can then translate the exception into error
> string and report back gracefully.
>
> Eftiquar
>
>
>
>
> On 1/19/18, 1:47 PM, "Li, Mu" <mli@amazon.com> wrote:
>
>     Very good document, thanks!
>
>      One issue with approach 1 is that resuming the operator after the
> failed one may cause error and even system hang. Say if op A writes var V
> while op B reads V. Then B will not be excited if A is failed, unless we
> clear their dependencies, but it will lead to wrong results as well.
>
>     Best
>     Mu
>
>     > On Jan 19, 2018, at 10:07 AM, Anirudh <anirudh2290@gmail.com> wrote:
>     >
>     > Hi,
>     >
>     > I have outlined the approach and proof of concept for Better
> Exception
>     > Handling in MXNet. Please provide feedback/comments/suggestions in
> the
>     > comments section of the wiki.
>     >
>     > https://cwiki.apache.org/confluence/display/MXNET/Improved+e
> xception+handling+in+MXNet
>     >
>     >
>     > Note: Responses will be delayed till 01/22/2018.
>     >
>     > Anirudh
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message