hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chia-Hung Lin <cli...@googlemail.com>
Subject Re: [DISCUSS] Fault tolerant BSP job
Date Thu, 10 Apr 2014 08:15:39 GMT
In that case I suppose we can simply revert Superstep to original
plain bsp function, and just sideline any issues related to FT at the
moment.


On 10 April 2014 13:24, Edward J. Yoon <edwardyoon@apache.org> wrote:
> As you know, we are still NOT supporting FT job processing, and
> there's no documentation. I might be wrong but we can *simply* restart
> whole tasks from the last checkpoint files on HDFS.
>
> It has been many years since we've discussed about FT and superstep
> API. And main contributors of FT job processing are currently
> inactive.
>
> May I close all old issue tickets? Let's just code it.
>
>
>
> On Thu, Apr 10, 2014 at 2:31 AM, Chia-Hung Lin <clin4j@googlemail.com> wrote:
>> That's why I proposed to use Superstep api instead, though I prefer
>> plain bsp function. Unless we want to instrument the source code,
>> which I believe is not what we, including users, want.
>>
>> With Superstep api we can resume the message from the latest (the new
>> refactored code should base on this as well) checkpointed message,
>> under some precondition.
>>
>> Alternative we can implement our own code (not Java or probably in
>> Java 8) to perform checkpoint, but that would take very long time in
>> accomplishing those tasks. I would put that issue in the future
>> roadmap because personally I perform plain bsp  function instead of
>> Superstep.
>>
>>
>> On 9 April 2014 23:56, Suraj Menon <surajsmenon@apache.org> wrote:
>>> I don't like my patch in HAMA-639 myself, eventhough I believe it satisfies
>>> all the mentioned requirements. The usage of superstep chaining API
>>> implementation in the patch is too complicated. A superstep here is like a
>>> transformation function you define on an RDD in Spark. So if you look into
>>> FT design of Spark, on failure, they rerun the operations on the RDD to get
>>> to the current state. This is similar to what we have in mind using
>>> checkpointing. The challenge is in getting the same messages replayed to
>>> newly spawned task on checkpointed data. If you don't use the Superstep(or
>>> any other abstraction representing a function) you cannot start processing
>>> from a line of code where the failure occurred. (Java does not support goto
>>> line number.)
>>>
>>> -Suraj
>>>
>>>
>>> On Wed, Apr 9, 2014 at 7:29 AM, Edward J. Yoon <edwardyoon@apache.org>wrote:
>>>
>>>> I just found this: https://issues.apache.org/jira/browse/HAMA-503 and
>>>> HAMA-639.
>>>>
>>>> Do you still think superstep API is essential for checkpoint/recovery?
>>>> If not, we can drop it. I don't think it's good idea.
>>>>
>>>> On Wed, Apr 9, 2014 at 7:43 PM, Chia-Hung Lin <clin4j@googlemail.com>
>>>> wrote:
>>>> > Not very sure if we sync at the same page. And sorry I am not very
>>>> > familiar with Superstep implementation.
>>>> >
>>>> > I assume that traditional bsp model means the original bsp interface
>>>> > where there is a bsp function and user can freely call peer.sync(),
>>>> > etc. methods
>>>> >
>>>> > .... bsp(BSPPeer ... peer) {
>>>> >     // whatever computation
>>>> >     peer.sync();
>>>> > }
>>>> >
>>>> > And the superstep style is with Superstep abstract class.
>>>> >
>>>> > If this is the case, SuperstepBSP.java has already call sync, as
>>>> > below, outside each Superstep.compute(). So it looks like even
>>>> > SuperstepPiEstimator doesn't call sync() method, barrier sync will be
>>>> > executed because each Superstep is viewed as a superstep in original
>>>> > BSP definition.
>>>> >
>>>> >   @Override
>>>> >   public void bsp(BSPPeer<K1, V1, K2, V2, M> peer) throws IOException,
>>>> >       SyncException, InterruptedException {
>>>> >     for (int index = startSuperstep; index < supersteps.length; index++)
>>>> {
>>>> >       Superstep<K1, V1, K2, V2, M> superstep = supersteps[index];
>>>> >       superstep.compute(peer);
>>>> >       if (superstep.haltComputation(peer)) {
>>>> >         break;
>>>> >       }
>>>> >       peer.sync();
>>>> >       startSuperstep = 0;
>>>> >     }
>>>> >   }
>>>> >
>>>> > Within the Superstep.compute(), if sync is called again, I would think
>>>> > that another barrier sync will be executed.
>>>> >
>>>> > SuperstepBSP.java
>>>> >
>>>> > for(...) {
>>>> >   superstep .compute() -> { // in compute method
>>>> >     ...
>>>> >     peer.sync()
>>>> >   }
>>>> >   ...
>>>> >   peer.sync()
>>>> > }
>>>> >
>>>> > IIRC each call to sync may raise the checkpoint (no recovery) method
>>>> > serialize message to hdfs.
>>>> >
>>>> > For SerializePrinting, following code snippet  may move
>>>> >
>>>> > for (String otherPeer : bspPeer.getAllPeerNames()) {
>>>> >         bspPeer.send(otherPeer, new
>>>> IntegerMessage(bspPeer.getPeerName(), i));
>>>> > }
>>>> >
>>>> > to Superstep.compute()
>>>> >
>>>> > And the outer for loop is what is programmed in SuperstepBSP.java
>>>> >
>>>> > for (int i = 0; i < NUM_SUPERSTEPS; i++) {
>>>> >     // code that should be moved to Superstep.compute()
>>>> > }
>>>> > bspPeer.sync();
>>>> >
>>>> >
>>>> >
>>>> > On 9 April 2014 16:17, Edward J. Yoon <edwardyoon@apache.org>
wrote:
>>>> >> As you can see here[1], the sync() method never called, and an classes
>>>> >> of all superstars were needed to be declared within Job configuration.
>>>> >> Therefore, I thought it's similar with Pregel style on BSP model.
It's
>>>> >> quite different from legacy model in my eyes.
>>>> >>
>>>> >> According to HAMA-505, superstep API seems used for FT job processing
>>>> >> (I didn't read closely yet). Right? In here, I have an questions.
What
>>>> >> happens if I call the sync() method within compute() method? In
this
>>>> >> case, framework guarantees the checkpoint/recovery? And how can
I
>>>> >> implement the http://wiki.apache.org/hama/SerializePrinting using
>>>> >> superstep API?
>>>> >>
>>>> >>> What's difference between pure BSP and FT BSP? Any concrete
example?
>>>> >>
>>>> >> I was mean the traditional BSP programming model.
>>>> >>
>>>> >> 1.
>>>> http://svn.apache.org/repos/asf/hama/trunk/examples/src/main/java/org/apache/hama/examples/SuperstepPiEstimator.java
>>>> >>
>>>> >> On Wed, Apr 9, 2014 at 4:25 PM, Chia-Hung Lin <clin4j@googlemail.com>
>>>> wrote:
>>>> >>> Sorry don't catch the point.
>>>> >>>
>>>> >>> What's difference between pure BSP and FT BSP? Any concrete
example?
>>>> >>>
>>>> >>>
>>>> >>> On 9 April 2014 08:29, Edward J. Yoon <edwardyoon@apache.org>
wrote:
>>>> >>>> In my eyes, SuperstepPiEstimator[1] look like totally new
programming
>>>> >>>> model, very similar with Pregel.
>>>> >>>>
>>>> >>>> I personally would like to suggest that we provide both
pure BSP and
>>>> >>>> fault tolerant BSP model, instead of replace.
>>>> >>>>
>>>> >>>> 1.
>>>> http://svn.apache.org/repos/asf/hama/trunk/examples/src/main/java/org/apache/hama/examples/SuperstepPiEstimator.java
>>>> >>>>
>>>> >>>> --
>>>> >>>> Edward J. Yoon (@eddieyoon)
>>>> >>>> Chief Executive Officer
>>>> >>>> DataSayer, Inc.
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Edward J. Yoon (@eddieyoon)
>>>> >> CEO at DataSayer Co., Ltd.
>>>>
>>>>
>>>>
>>>> --
>>>> Edward J. Yoon (@eddieyoon)
>>>> Chief Executive Officer
>>>> DataSayer Co., Ltd.
>>>>
>
>
>
> --
> Edward J. Yoon (@eddieyoon)
> Chief Executive Officer
> DataSayer Co., Ltd.

Mime
View raw message