Return-Path: X-Original-To: apmail-incubator-hama-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-hama-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7E8CA922C for ; Mon, 19 Sep 2011 05:55:36 +0000 (UTC) Received: (qmail 5872 invoked by uid 500); 19 Sep 2011 05:55:36 -0000 Delivered-To: apmail-incubator-hama-dev-archive@incubator.apache.org Received: (qmail 4884 invoked by uid 500); 19 Sep 2011 05:55:35 -0000 Mailing-List: contact hama-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hama-dev@incubator.apache.org Delivered-To: mailing list hama-dev@incubator.apache.org Received: (qmail 4859 invoked by uid 99); 19 Sep 2011 05:55:34 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 Sep 2011 05:55:34 +0000 X-ASF-Spam-Status: No, hits=4.0 required=5.0 tests=FREEMAIL_FROM,FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of thomas.jungblut@googlemail.com designates 209.85.212.41 as permitted sender) Received: from [209.85.212.41] (HELO mail-vw0-f41.google.com) (209.85.212.41) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 Sep 2011 05:55:28 +0000 Received: by vwm42 with SMTP id 42so10326128vwm.0 for ; Sun, 18 Sep 2011 22:55:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=h/eXk40ZGRXJS1bMY96/DGfs1TpwjRM9p27AEoVI3sE=; b=nBK2uBQ+p+cR4+1Yyomws8TM/pumIFvziJzyiLV8ckTuy12BzXwAW0k2y9Ad38bhs/ ZnBaRWyVTU2ZuGXycHrdgrHJANzl+yPPsFk9anS/C9oiH6OdNkn0docyysHaUgpzTP1i mQ1zG939ITS84RRV9/O5DBoqxRIeE1A/kiUPY= MIME-Version: 1.0 Received: by 10.52.29.136 with SMTP id k8mr1587038vdh.283.1316411707433; Sun, 18 Sep 2011 22:55:07 -0700 (PDT) Received: by 10.52.187.136 with HTTP; Sun, 18 Sep 2011 22:55:07 -0700 (PDT) In-Reply-To: <1316403269.9066.chl501@nuk.edu.tw> References: <1316403269.9066.chl501@nuk.edu.tw> Date: Mon, 19 Sep 2011 07:55:07 +0200 Message-ID: Subject: Re: [Discussion] Refactor bsp() for recovery procedure From: Thomas Jungblut To: hama-dev@incubator.apache.org, chl501@nuk.edu.tw Content-Type: multipart/alternative; boundary=20cf307d030033226004ad44fc04 X-Virus-Checked: Checked by ClamAV on apache.org --20cf307d030033226004ad44fc04 Content-Type: text/plain; charset=ISO-8859-1 Hi ChiaHung, I would not split this into several classes like SuperStep1 or SuperStep2 and the chaining sounds a bit strange to me. But, what I think your idea is cool, the BSPSuperstep class is starting after a sync phase and is ending with it (easier for the user, because the workflow is simpler). Here is my proposal: BSPSuperstep step; > int rollbackSuperStep = -1; > if((rollbackSuperStep = conf.getInt(bsp.rollback.superstep) ) > -1)[ > step = BSPSuperstep.getSuperStep(rollbackSuperStep); > } > while(!notHalted){ > sync(); > step = new BSPSuperstep(CURRENT_NUMBER_OF_SUPERSTEP); > step.compute(List list); > save(step); > notHalted = checkHalted(); > } > I know that diverges alot from your idea. Maybe you have to put the sync into the tail of the loop. But what do you think on that? 2011/9/19 ChiaHung Lin > > Currently we have bsp() where users can code for performing thier tasks. For instance, > > ... bsp() ...{ > ... // some computation > sync(); > ... // some other computation > sync(); > ... > } > > However, this is difficult for recovery because 1st, it requires checkpointed messages to be recovered so that the computation can be resumed from where it fails; 2nd, the recovery procedure needs to know from which super step to restart. With the current bsp(), it seems a common choice is preprocessing; but this may not be good because when internally something goes wrong it, it is not easy to find out the problem. > > I come up with an alternative method but this would have change to the way of our current procedure. So I think it would be good to discuss it first. It is proposed as below: > > 1. we divide bsp() into smaller computation unit called e.g. step() or superstep(), within which user still write their own logic. > > 2. in main, user composes the order of supersteps. > > ... class Superstep1 extends BSPSuperstep { > ... superstep() ... {...} > } > ... class Superstep2 extends BSPSuperstep { > ... superstep() ... {...} > } > > BSPJob bsp = new BSP(...); > bsp.compose(Superstep1.class).compose(Superstep2.class)...; > > Therefore, when recovery, in BSPTask run() we can have > > List steps = BSPJob.supersteps(); > > for(BSPSuperstep step: steps) { > if(checkpointed) { > // restore checkpointed messages e.g. adding checkpointed msg (in hdfs) back to queues > } > step.superstep(...); > step.sync(); > } > > The advantage is easier for recovery procedure. > The disadvantage may be the client programme need to explicitly tell the order of superstep. > > Any thought? > > -- > ChiaHung Lin > Department of Information Management > National University of Kaohsiung > Taiwan -- Thomas Jungblut Berlin mobile: 0170-3081070 business: thomas.jungblut@testberichte.de private: thomas.jungblut@gmail.com --20cf307d030033226004ad44fc04--