hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward J. Yoon" <edwardy...@apache.org>
Subject Re: [DISCUSS] Fault tolerant BSP job
Date Thu, 10 Apr 2014 05:24:19 GMT
As you know, we are still NOT supporting FT job processing, and
there's no documentation. I might be wrong but we can *simply* restart
whole tasks from the last checkpoint files on HDFS.

It has been many years since we've discussed about FT and superstep
API. And main contributors of FT job processing are currently
inactive.

May I close all old issue tickets? Let's just code it.



On Thu, Apr 10, 2014 at 2:31 AM, Chia-Hung Lin <clin4j@googlemail.com> wrote:
> That's why I proposed to use Superstep api instead, though I prefer
> plain bsp function. Unless we want to instrument the source code,
> which I believe is not what we, including users, want.
>
> With Superstep api we can resume the message from the latest (the new
> refactored code should base on this as well) checkpointed message,
> under some precondition.
>
> Alternative we can implement our own code (not Java or probably in
> Java 8) to perform checkpoint, but that would take very long time in
> accomplishing those tasks. I would put that issue in the future
> roadmap because personally I perform plain bsp  function instead of
> Superstep.
>
>
> On 9 April 2014 23:56, Suraj Menon <surajsmenon@apache.org> wrote:
>> I don't like my patch in HAMA-639 myself, eventhough I believe it satisfies
>> all the mentioned requirements. The usage of superstep chaining API
>> implementation in the patch is too complicated. A superstep here is like a
>> transformation function you define on an RDD in Spark. So if you look into
>> FT design of Spark, on failure, they rerun the operations on the RDD to get
>> to the current state. This is similar to what we have in mind using
>> checkpointing. The challenge is in getting the same messages replayed to
>> newly spawned task on checkpointed data. If you don't use the Superstep(or
>> any other abstraction representing a function) you cannot start processing
>> from a line of code where the failure occurred. (Java does not support goto
>> line number.)
>>
>> -Suraj
>>
>>
>> On Wed, Apr 9, 2014 at 7:29 AM, Edward J. Yoon <edwardyoon@apache.org>wrote:
>>
>>> I just found this: https://issues.apache.org/jira/browse/HAMA-503 and
>>> HAMA-639.
>>>
>>> Do you still think superstep API is essential for checkpoint/recovery?
>>> If not, we can drop it. I don't think it's good idea.
>>>
>>> On Wed, Apr 9, 2014 at 7:43 PM, Chia-Hung Lin <clin4j@googlemail.com>
>>> wrote:
>>> > Not very sure if we sync at the same page. And sorry I am not very
>>> > familiar with Superstep implementation.
>>> >
>>> > I assume that traditional bsp model means the original bsp interface
>>> > where there is a bsp function and user can freely call peer.sync(),
>>> > etc. methods
>>> >
>>> > .... bsp(BSPPeer ... peer) {
>>> >     // whatever computation
>>> >     peer.sync();
>>> > }
>>> >
>>> > And the superstep style is with Superstep abstract class.
>>> >
>>> > If this is the case, SuperstepBSP.java has already call sync, as
>>> > below, outside each Superstep.compute(). So it looks like even
>>> > SuperstepPiEstimator doesn't call sync() method, barrier sync will be
>>> > executed because each Superstep is viewed as a superstep in original
>>> > BSP definition.
>>> >
>>> >   @Override
>>> >   public void bsp(BSPPeer<K1, V1, K2, V2, M> peer) throws IOException,
>>> >       SyncException, InterruptedException {
>>> >     for (int index = startSuperstep; index < supersteps.length; index++)
>>> {
>>> >       Superstep<K1, V1, K2, V2, M> superstep = supersteps[index];
>>> >       superstep.compute(peer);
>>> >       if (superstep.haltComputation(peer)) {
>>> >         break;
>>> >       }
>>> >       peer.sync();
>>> >       startSuperstep = 0;
>>> >     }
>>> >   }
>>> >
>>> > Within the Superstep.compute(), if sync is called again, I would think
>>> > that another barrier sync will be executed.
>>> >
>>> > SuperstepBSP.java
>>> >
>>> > for(...) {
>>> >   superstep .compute() -> { // in compute method
>>> >     ...
>>> >     peer.sync()
>>> >   }
>>> >   ...
>>> >   peer.sync()
>>> > }
>>> >
>>> > IIRC each call to sync may raise the checkpoint (no recovery) method
>>> > serialize message to hdfs.
>>> >
>>> > For SerializePrinting, following code snippet  may move
>>> >
>>> > for (String otherPeer : bspPeer.getAllPeerNames()) {
>>> >         bspPeer.send(otherPeer, new
>>> IntegerMessage(bspPeer.getPeerName(), i));
>>> > }
>>> >
>>> > to Superstep.compute()
>>> >
>>> > And the outer for loop is what is programmed in SuperstepBSP.java
>>> >
>>> > for (int i = 0; i < NUM_SUPERSTEPS; i++) {
>>> >     // code that should be moved to Superstep.compute()
>>> > }
>>> > bspPeer.sync();
>>> >
>>> >
>>> >
>>> > On 9 April 2014 16:17, Edward J. Yoon <edwardyoon@apache.org> wrote:
>>> >> As you can see here[1], the sync() method never called, and an classes
>>> >> of all superstars were needed to be declared within Job configuration.
>>> >> Therefore, I thought it's similar with Pregel style on BSP model. It's
>>> >> quite different from legacy model in my eyes.
>>> >>
>>> >> According to HAMA-505, superstep API seems used for FT job processing
>>> >> (I didn't read closely yet). Right? In here, I have an questions. What
>>> >> happens if I call the sync() method within compute() method? In this
>>> >> case, framework guarantees the checkpoint/recovery? And how can I
>>> >> implement the http://wiki.apache.org/hama/SerializePrinting using
>>> >> superstep API?
>>> >>
>>> >>> What's difference between pure BSP and FT BSP? Any concrete example?
>>> >>
>>> >> I was mean the traditional BSP programming model.
>>> >>
>>> >> 1.
>>> http://svn.apache.org/repos/asf/hama/trunk/examples/src/main/java/org/apache/hama/examples/SuperstepPiEstimator.java
>>> >>
>>> >> On Wed, Apr 9, 2014 at 4:25 PM, Chia-Hung Lin <clin4j@googlemail.com>
>>> wrote:
>>> >>> Sorry don't catch the point.
>>> >>>
>>> >>> What's difference between pure BSP and FT BSP? Any concrete example?
>>> >>>
>>> >>>
>>> >>> On 9 April 2014 08:29, Edward J. Yoon <edwardyoon@apache.org>
wrote:
>>> >>>> In my eyes, SuperstepPiEstimator[1] look like totally new programming
>>> >>>> model, very similar with Pregel.
>>> >>>>
>>> >>>> I personally would like to suggest that we provide both pure
BSP and
>>> >>>> fault tolerant BSP model, instead of replace.
>>> >>>>
>>> >>>> 1.
>>> http://svn.apache.org/repos/asf/hama/trunk/examples/src/main/java/org/apache/hama/examples/SuperstepPiEstimator.java
>>> >>>>
>>> >>>> --
>>> >>>> Edward J. Yoon (@eddieyoon)
>>> >>>> Chief Executive Officer
>>> >>>> DataSayer, Inc.
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Edward J. Yoon (@eddieyoon)
>>> >> CEO at DataSayer Co., Ltd.
>>>
>>>
>>>
>>> --
>>> Edward J. Yoon (@eddieyoon)
>>> Chief Executive Officer
>>> DataSayer Co., Ltd.
>>>



-- 
Edward J. Yoon (@eddieyoon)
Chief Executive Officer
DataSayer Co., Ltd.

Mime
View raw message