mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jie Yu" <yujie....@gmail.com>
Subject Review Request 14686: Catch-up Replicated Log 2: Fixed a Race Condition Caused by Passing Process Wrapper Pointers.
Date Wed, 16 Oct 2013 21:49:24 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/14686/
-----------------------------------------------------------

Review request for mesos, Benjamin Hindman, Ben Mahler, and Vinod Kone.


Repository: mesos-git


Description
-------

This is the second patch of a series of patches that implement a catch-up mechanism for replicated
log. This patch fixed a race condition caused by passing process wrapper pointers (e.g. Network,
Replica, etc.).

In our code base, one commonly used routine is to use a wrapper class to wrap a libprocess
so that we don't need to worry about whether we should directly call the function or should
use a dispatch. For example:

class FooProcess : public Process<FooProcess> {
  int foo();
};

class Foo {
  Future<int> foo() { return dispatch(process, &FooProcess::foo); }
  ...
  FooProcess* process;
};

If multiple entities want to share a libprocess (e.g. FooProcess), currently, we pass a pointer
of its wrapper (e.g. Foo*) to the users. However, this is problematic in some cases. The main
reason is that we don't know when it is safe to destroy the libprocess (FooProcess) and delete
the wrapper (Foo) if a pointer of this wrapper has been passed to some other entities in the
system and we don't know whether they will use it or not later.

There is a race condition in our first patch. Here is the buggy interleaving:

T1                                   T2

appending.discard();
                                     ExplicitPromiseProcess::initialize() {
onDiscarded() {
  terminate ExplicitPromiseProcess
}

delete network;
                      
                                       ...
                                       network->broadcast(protocol::promise, request)
                                     }

(We added a new test CoordinatorTest.FillTimeout which can reproduce this bug and will segfault
without this patch.)


To fix this problem, we introduce a concept called ProcessHandle to generalize process wrappers
(see process.hpp). It internally maintains a shared pointer to the underlying libprocess.
It is copyable. Multiple copies of the same handle share the same underlying libprocess. When
the last copy is deleted, we will terminate the underlying process and delete the libprocess.
If a user wants to early terminate the libprocess, he can explicitly call the terminate()
function.

Now, creating a wrapper for a libprocess is simply as that:

class Foo : public ProcessHandle<FooProcess> { ... };

This change won't break the existing code (i.e., if you use the old way of passing wrapper
pointers, this change is like a noop).

This is a joint work with Yan Xu.


Diffs
-----

  3rdparty/libprocess/include/process/process.hpp b502324 
  src/Makefile.am a2d8242 
  src/log/catchup.hpp PRE-CREATION 
  src/log/catchup.cpp PRE-CREATION 
  src/log/consensus.hpp PRE-CREATION 
  src/log/consensus.cpp PRE-CREATION 
  src/log/coordinator.hpp 3f6fb7c 
  src/log/coordinator.cpp 6e6466f 
  src/log/log.hpp 77edc7a 
  src/log/network.hpp d34cf78 
  src/log/replica.hpp d1f5ead 
  src/log/replica.cpp 59a6ff3 
  src/log/zookeeper.hpp PRE-CREATION 
  src/log/zookeeper.cpp PRE-CREATION 
  src/tests/log_tests.cpp ff5f86c 

Diff: https://reviews.apache.org/r/14686/diff/


Testing
-------

bin/mesos-tests.sh --gtest_filter=CoordinatorTest*:ReplicaTest*:LogTest* --gtest_repeat=100


Thanks,

Jie Yu


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message