Return-Path: X-Original-To: apmail-hama-dev-archive@www.apache.org Delivered-To: apmail-hama-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8A20211167 for ; Wed, 9 Apr 2014 15:56:36 +0000 (UTC) Received: (qmail 95486 invoked by uid 500); 9 Apr 2014 15:56:36 -0000 Delivered-To: apmail-hama-dev-archive@hama.apache.org Received: (qmail 95372 invoked by uid 500); 9 Apr 2014 15:56:33 -0000 Mailing-List: contact dev-help@hama.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hama.apache.org Delivered-To: mailing list dev@hama.apache.org Received: (qmail 95357 invoked by uid 99); 9 Apr 2014 15:56:31 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Apr 2014 15:56:31 +0000 Received: from localhost (HELO mail-oa0-f47.google.com) (127.0.0.1) (smtp-auth username surajsmenon, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Apr 2014 15:56:30 +0000 Received: by mail-oa0-f47.google.com with SMTP id i11so2935231oag.6 for ; Wed, 09 Apr 2014 08:56:29 -0700 (PDT) MIME-Version: 1.0 X-Received: by 10.60.34.65 with SMTP id x1mr9313986oei.6.1397058989569; Wed, 09 Apr 2014 08:56:29 -0700 (PDT) Received: by 10.76.11.106 with HTTP; Wed, 9 Apr 2014 08:56:29 -0700 (PDT) In-Reply-To: References: Date: Wed, 9 Apr 2014 11:56:29 -0400 Message-ID: Subject: Re: [DISCUSS] Fault tolerant BSP job From: Suraj Menon To: "dev@hama.apache.org" Content-Type: multipart/alternative; boundary=089e0111bf72ce55b004f69e24bb --089e0111bf72ce55b004f69e24bb Content-Type: text/plain; charset=ISO-8859-1 I don't like my patch in HAMA-639 myself, eventhough I believe it satisfies all the mentioned requirements. The usage of superstep chaining API implementation in the patch is too complicated. A superstep here is like a transformation function you define on an RDD in Spark. So if you look into FT design of Spark, on failure, they rerun the operations on the RDD to get to the current state. This is similar to what we have in mind using checkpointing. The challenge is in getting the same messages replayed to newly spawned task on checkpointed data. If you don't use the Superstep(or any other abstraction representing a function) you cannot start processing from a line of code where the failure occurred. (Java does not support goto line number.) -Suraj On Wed, Apr 9, 2014 at 7:29 AM, Edward J. Yoon wrote: > I just found this: https://issues.apache.org/jira/browse/HAMA-503 and > HAMA-639. > > Do you still think superstep API is essential for checkpoint/recovery? > If not, we can drop it. I don't think it's good idea. > > On Wed, Apr 9, 2014 at 7:43 PM, Chia-Hung Lin > wrote: > > Not very sure if we sync at the same page. And sorry I am not very > > familiar with Superstep implementation. > > > > I assume that traditional bsp model means the original bsp interface > > where there is a bsp function and user can freely call peer.sync(), > > etc. methods > > > > .... bsp(BSPPeer ... peer) { > > // whatever computation > > peer.sync(); > > } > > > > And the superstep style is with Superstep abstract class. > > > > If this is the case, SuperstepBSP.java has already call sync, as > > below, outside each Superstep.compute(). So it looks like even > > SuperstepPiEstimator doesn't call sync() method, barrier sync will be > > executed because each Superstep is viewed as a superstep in original > > BSP definition. > > > > @Override > > public void bsp(BSPPeer peer) throws IOException, > > SyncException, InterruptedException { > > for (int index = startSuperstep; index < supersteps.length; index++) > { > > Superstep superstep = supersteps[index]; > > superstep.compute(peer); > > if (superstep.haltComputation(peer)) { > > break; > > } > > peer.sync(); > > startSuperstep = 0; > > } > > } > > > > Within the Superstep.compute(), if sync is called again, I would think > > that another barrier sync will be executed. > > > > SuperstepBSP.java > > > > for(...) { > > superstep .compute() -> { // in compute method > > ... > > peer.sync() > > } > > ... > > peer.sync() > > } > > > > IIRC each call to sync may raise the checkpoint (no recovery) method > > serialize message to hdfs. > > > > For SerializePrinting, following code snippet may move > > > > for (String otherPeer : bspPeer.getAllPeerNames()) { > > bspPeer.send(otherPeer, new > IntegerMessage(bspPeer.getPeerName(), i)); > > } > > > > to Superstep.compute() > > > > And the outer for loop is what is programmed in SuperstepBSP.java > > > > for (int i = 0; i < NUM_SUPERSTEPS; i++) { > > // code that should be moved to Superstep.compute() > > } > > bspPeer.sync(); > > > > > > > > On 9 April 2014 16:17, Edward J. Yoon wrote: > >> As you can see here[1], the sync() method never called, and an classes > >> of all superstars were needed to be declared within Job configuration. > >> Therefore, I thought it's similar with Pregel style on BSP model. It's > >> quite different from legacy model in my eyes. > >> > >> According to HAMA-505, superstep API seems used for FT job processing > >> (I didn't read closely yet). Right? In here, I have an questions. What > >> happens if I call the sync() method within compute() method? In this > >> case, framework guarantees the checkpoint/recovery? And how can I > >> implement the http://wiki.apache.org/hama/SerializePrinting using > >> superstep API? > >> > >>> What's difference between pure BSP and FT BSP? Any concrete example? > >> > >> I was mean the traditional BSP programming model. > >> > >> 1. > http://svn.apache.org/repos/asf/hama/trunk/examples/src/main/java/org/apache/hama/examples/SuperstepPiEstimator.java > >> > >> On Wed, Apr 9, 2014 at 4:25 PM, Chia-Hung Lin > wrote: > >>> Sorry don't catch the point. > >>> > >>> What's difference between pure BSP and FT BSP? Any concrete example? > >>> > >>> > >>> On 9 April 2014 08:29, Edward J. Yoon wrote: > >>>> In my eyes, SuperstepPiEstimator[1] look like totally new programming > >>>> model, very similar with Pregel. > >>>> > >>>> I personally would like to suggest that we provide both pure BSP and > >>>> fault tolerant BSP model, instead of replace. > >>>> > >>>> 1. > http://svn.apache.org/repos/asf/hama/trunk/examples/src/main/java/org/apache/hama/examples/SuperstepPiEstimator.java > >>>> > >>>> -- > >>>> Edward J. Yoon (@eddieyoon) > >>>> Chief Executive Officer > >>>> DataSayer, Inc. > >> > >> > >> > >> -- > >> Edward J. Yoon (@eddieyoon) > >> CEO at DataSayer Co., Ltd. > > > > -- > Edward J. Yoon (@eddieyoon) > Chief Executive Officer > DataSayer Co., Ltd. > --089e0111bf72ce55b004f69e24bb--