mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Connor Doyle" <con...@mesosphere.io>
Subject Re: Review Request 25614: Safer handling of futures in JNI state abstraction
Date Mon, 06 Oct 2014 23:26:24 GMT


> On Sept. 15, 2014, 9:17 p.m., Jie Yu wrote:
> > This more looks like a bug somewhere. Let's try to find out the true root cause
first.
> 
> Niklas Nielsen wrote:
>     True - but throwing the result away from await() seems silly/unsafe too (rather on
relying on the await to have completed or wait forever). Our investigation ended in https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/latch.cpp#L58
(as explained in the RR description).
>     We haven't been able to realiably reproduce the fault to debug it - Connor, can you
fill in on when it happened in Marathon?
> 
> Connor Doyle wrote:
>     Sure, will try to reproduce with more verbose logging on a fresh cluster tomorrow.
> 
> Ben Mahler wrote:
>     The semantics of Future::await() are such that if no Duration is provided, then it
will never return false:
>     
>     ```
>       f.await(); // Always returns true.
>       // Now the future should not be pending!
>     ```
>     
>     It should only return false as the result of a timeout.
>     
>     That's why Jie is saying we need to understand the bug, it's not even clear whether
await returned false (which should not occur) vs. returned true but the CHECK_READY still
saw the future as pending.
> 
> Niklas Nielsen wrote:
>     Connor, do you want to close this issue for now and continue the investigation in
the JIRA ticket? We can reopen if this is indeed the issue.
>     
>     Thanks

Yes, that seems appropriate as this patch seems unlikely to be accepted in its current state.


- Connor


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25614/#review53409
-----------------------------------------------------------


On Sept. 15, 2014, 8:54 p.m., Connor Doyle wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25614/
> -----------------------------------------------------------
> 
> (Updated Sept. 15, 2014, 8:54 p.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Bugs: MESOS-1795
>     https://issues.apache.org/jira/browse/MESOS-1795
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> ## OVERVIEW
> 
> This change adds a measure of safety for calls to `Future<T>::await()`
> without a timeout from within the JNI interface to the state
> abstraction.
> 
> ## MOTIVATION
> 
> Observed the following log output prior to a crash of the Marathon scheduler:
> 
> ```
> Sep 12 23:46:01 highly-available-457-540 marathon[11494]: F0912 23:46:01.771927 11532
org_apache_mesos_state_AbstractState.cpp:145] CHECK_READY(*future): is PENDING 
> Sep 12 23:46:01 highly-available-457-540 marathon[11494]: *** Check failure stack trace:
***
> Sep 12 23:46:01 highly-available-457-540 marathon[11494]:     @     0x7febc2663a2d  google::LogMessage::Fail()
> Sep 12 23:46:01 highly-available-457-540 marathon[11494]:     @     0x7febc26657e3  google::LogMessage::SendToLog()
> Sep 12 23:46:01 highly-available-457-540 marathon[11494]:     @     0x7febc2663648  google::LogMessage::Flush()
> Sep 12 23:46:01 highly-available-457-540 marathon[11494]:     @     0x7febc266603e  google::LogMessageFatal::~LogMessageFatal()
> Sep 12 23:46:01 highly-available-457-540 marathon[11494]:     @     0x7febc26588a3  Java_org_apache_mesos_state_AbstractState__1_1fetch_1get
> Sep 12 23:46:01 highly-available-457-540 marathon[11494]:     @     0x7febcd107d98  (unknown)
> ```
> _Listing 1: Crash log output._
> 
> ## CHANGES IN DETAIL
> 
> The in-line comments in `Latch::await(const Duration& duration)`, mention
> that it's possible for the call to return before the underlying process
> has completed for two reasons:
> 
> ```
> ...
> process::wait(pid, duration); // Explict to disambiguate.
> // It's possible that we failed to wait because:
> //   (1) Our process has already terminated.
> //   (2) We timed out (i.e., duration was not "infinite").
> ...
> ```
> _Listing 2: Excerpt from 3rdparty/src/libprocess/latch.cpp._
> 
> Call sites within `org_apache_mesos_state_AbstractState.cpp`
> that formerly discarded the return value now throw a
> `java.concurrent.ExecutionException` if the await fails.  I chose to throw
> this instead of a `java.concurrent.TimeoutException` since the expectation
> from the caller's perspective is to block forever pending a ready state.
> 
> 
> Diffs
> -----
> 
>   src/java/jni/org_apache_mesos_state_AbstractState.cpp 1accc8a 
> 
> Diff: https://reviews.apache.org/r/25614/diff/
> 
> 
> Testing
> -------
> 
> - make
> - make check
> 
> 
> Thanks,
> 
> Connor Doyle
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message