Return-Path: X-Original-To: apmail-hama-dev-archive@www.apache.org Delivered-To: apmail-hama-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E4506D10A for ; Wed, 21 Nov 2012 18:43:20 +0000 (UTC) Received: (qmail 74945 invoked by uid 500); 21 Nov 2012 18:43:20 -0000 Delivered-To: apmail-hama-dev-archive@hama.apache.org Received: (qmail 74923 invoked by uid 500); 21 Nov 2012 18:43:20 -0000 Mailing-List: contact dev-help@hama.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hama.apache.org Delivered-To: mailing list dev@hama.apache.org Received: (qmail 74915 invoked by uid 99); 21 Nov 2012 18:43:20 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Nov 2012 18:43:20 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of menonsuraj5@gmail.com designates 209.85.215.47 as permitted sender) Received: from [209.85.215.47] (HELO mail-la0-f47.google.com) (209.85.215.47) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Nov 2012 18:43:12 +0000 Received: by mail-la0-f47.google.com with SMTP id u2so5816653lag.34 for ; Wed, 21 Nov 2012 10:42:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=eCcuX2H9unjXreNid3aqXsTeze5J/qMOz/Ri8I2rBok=; b=A5e0N0c5fTVlL07/RGT9mXgX/Dy1ZVRIsMoT+KOWWUeLRLDSpiUGo9KT4bwdg7YMGK xJoQpzxhlnfBEF/9vaoR1HascQQutiWxkEiCHg2ulCR/Ibeu2rIk4MNeFKAekadn7cn/ Sb1oLsGLIsebQw6ebc6nOi/Wk/xvV4pE4bELNLKqt/pL2GiSNb8MdiXnGcz7ZlezyaQA fwpEFUUWYR+D8iVASUCZ7uJh+R3k8VtgQ7q4mzpKo0Hua+RHNqNcIwFNiqpWV0+g2GlS tYm40r0U7KPkE4F0gPAA9K6lhS5QsjVqcrmSztRC7O6Shql+NyW+K5Q3lXVIJb2pkEt1 qCPg== MIME-Version: 1.0 Received: by 10.152.108.48 with SMTP id hh16mr1741884lab.25.1353523371448; Wed, 21 Nov 2012 10:42:51 -0800 (PST) Received: by 10.114.60.143 with HTTP; Wed, 21 Nov 2012 10:42:51 -0800 (PST) In-Reply-To: References: <20121121060427.E079623888E3@eris.apache.org> Date: Wed, 21 Nov 2012 13:42:51 -0500 Message-ID: Subject: Re: svn commit: r1411991 - /hama/trunk/core/src/main/java/org/apache/hama/bsp/GroomServer.java From: Suraj Menon To: dev@hama.apache.org Content-Type: multipart/alternative; boundary=bcaec54fb9d2c0193004cf05b79f X-Virus-Checked: Checked by ClamAV on apache.org --bcaec54fb9d2c0193004cf05b79f Content-Type: text/plain; charset=ISO-8859-1 Hi, It shouldn't matter, just that we will be a little late to find out a dead task. With this, you have kept the pinging rate same, but have given the task more leeway to miss his pings. Are you facing a situation where the ping is not happening as per the timer? What is the difference between increasing monitor period by 6 instead of increasing the ping interval by 6? On Wed, Nov 21, 2012 at 6:30 AM, Edward J. Yoon wrote: > Hi, > > I saw most failures are caused by too-sensitive monitoring. Please > check whether any problem can be occurred with this change. > > + && (((tip.lastPingedTimestamp == 0 && ((currentTime - > tip.startTime) > 10 * monitorPeriod)) || ((tip.lastPingedTimestamp > > 0) && (currentTime - tip.lastPingedTimestamp) > 6 * monitorPeriod)))) > { > > > On Wed, Nov 21, 2012 at 3:04 PM, wrote: > > Author: edwardyoon > > Date: Wed Nov 21 06:04:26 2012 > > New Revision: 1411991 > > > > URL: http://svn.apache.org/viewvc?rev=1411991&view=rev > > Log: > > Monitoring of tasks is too sensitive. > > > > Modified: > > hama/trunk/core/src/main/java/org/apache/hama/bsp/GroomServer.java > > > > Modified: > hama/trunk/core/src/main/java/org/apache/hama/bsp/GroomServer.java > > URL: > http://svn.apache.org/viewvc/hama/trunk/core/src/main/java/org/apache/hama/bsp/GroomServer.java?rev=1411991&r1=1411990&r2=1411991&view=diff > > > ============================================================================== > > --- hama/trunk/core/src/main/java/org/apache/hama/bsp/GroomServer.java > (original) > > +++ hama/trunk/core/src/main/java/org/apache/hama/bsp/GroomServer.java > Wed Nov 21 06:04:26 2012 > > @@ -207,10 +207,9 @@ public class GroomServer implements Runn > > try { > > startRecoveryTask(recoverAction); > > } catch (IOException e) { > > - throw new DirectiveException( > > - new StringBuffer().append("Error starting the > recovery task") > > - .append(t.getTaskID()).toString(), > > - e); > > + throw new DirectiveException(new StringBuffer() > > + .append("Error starting the recovery task") > > + .append(t.getTaskID()).toString(), e); > > } > > } > > } > > @@ -617,17 +616,17 @@ public class GroomServer implements Runn > > } > > > > Iterator taskIterator = tasks.keySet().iterator(); > > - while(taskIterator.hasNext()){ > > + while (taskIterator.hasNext()) { > > TaskAttemptID taskAttId = taskIterator.next(); > > - if(taskAttId.getTaskID().equals(t.getTaskID().getTaskID())){ > > - if(LOG.isDebugEnabled()){ > > + if (taskAttId.getTaskID().equals(t.getTaskID().getTaskID())) { > > + if (LOG.isDebugEnabled()) { > > LOG.debug("Removing tasks with id = " + > t.getTaskID().getTaskID()); > > } > > taskIterator.remove(); > > runningTasks.remove(taskAttId); > > } > > } > > - > > + > > tasks.put(t.getTaskID(), tip); > > runningTasks.put(t.getTaskID(), tip); > > } > > @@ -637,14 +636,14 @@ public class GroomServer implements Runn > > String msg = ("Error initializing " + tip.getTask().getTaskID() + > ":\n" + StringUtils > > .stringifyException(e)); > > LOG.warn(msg); > > - > > + > > try { > > tip.killAndCleanup(true); > > } catch (IOException ie2) { > > LOG.info("Error cleaning up " + tip.getTask().getTaskID() + > ":\n" > > + StringUtils.stringifyException(ie2)); > > } > > - throw new IOException("Errro localizing the job.",e); > > + throw new IOException("Errro localizing the job.", e); > > } > > } > > > > @@ -807,20 +806,17 @@ public class GroomServer implements Runn > > + " monitorPeriod = " > > + monitorPeriod > > + " check = " > > - + > (tip.taskStatus.getRunState().equals(TaskStatus.State.RUNNING) && > > - (((tip.lastPingedTimestamp == 0 && > > - ((currentTime - tip.startTime) > 10 * monitorPeriod)) || > > - ((tip.lastPingedTimestamp > 0) && > > - (currentTime - tip.lastPingedTimestamp) > > monitorPeriod))))); > > + + > (tip.taskStatus.getRunState().equals(TaskStatus.State.RUNNING) && > (((tip.lastPingedTimestamp == 0 && ((currentTime - tip.startTime) > 10 * > monitorPeriod)) || ((tip.lastPingedTimestamp > 0) && (currentTime - > tip.lastPingedTimestamp) > 6 * monitorPeriod))))); > > > > // Task is out of contact if it has not pinged since more than > > // monitorPeriod. A task is given a leeway of 10 times > monitorPeriod > > // to get started. > > + > > + // TODO Please refactor this conditions > > + // NOTE: (currentTime - tip.lastPingedTimestamp) > 6 * > monitorPeriod > > + > > if (tip.taskStatus.getRunState().equals(TaskStatus.State.RUNNING) > > - && (((tip.lastPingedTimestamp == 0 > > - && ((currentTime - tip.startTime) > 10 * monitorPeriod)) > > - || ((tip.lastPingedTimestamp > 0) > > - && (currentTime - tip.lastPingedTimestamp) > > monitorPeriod)))) { > > + && (((tip.lastPingedTimestamp == 0 && ((currentTime - > tip.startTime) > 10 * monitorPeriod)) || ((tip.lastPingedTimestamp > 0) && > (currentTime - tip.lastPingedTimestamp) > 6 * monitorPeriod)))) { > > > > LOG.info("adding purge task: " + tip.getTask().getTaskID()); > > > > @@ -1048,7 +1044,7 @@ public class GroomServer implements Runn > > > > // runner could be null if task-cleanup attempt is not localized > yet > > if (runner != null) { > > - if(LOG.isDebugEnabled()){ > > + if (LOG.isDebugEnabled()) { > > LOG.debug("Killing process for " + this.task.getTaskID()); > > } > > runner.killBsp(); > > @@ -1058,7 +1054,7 @@ public class GroomServer implements Runn > > > > public synchronized void killRunner() throws IOException { > > if (runner != null) { > > - if(LOG.isDebugEnabled()){ > > + if (LOG.isDebugEnabled()) { > > LOG.debug("Killing process for " + this.task.getTaskID()); > > } > > runner.killBsp(); > > @@ -1251,12 +1247,11 @@ public class GroomServer implements Runn > > defaultConf.setInt("bsp.checkpoint.port", > Integer.parseInt(args[4])); > > } > > defaultConf.setInt(Constants.PEER_PORT, peerPort); > > - > > + > > long superstep = Long.parseLong(args[4]); > > TaskStatus.State state = TaskStatus.State.valueOf(args[5]); > > LOG.debug("Starting peer for sstep " + superstep + " state = " + > state); > > > > - > > try { > > // use job-specified working directory > > FileSystem.get(job.getConfiguration()).setWorkingDirectory( > > > > > > > > -- > Best Regards, Edward J. Yoon > @eddieyoon > --bcaec54fb9d2c0193004cf05b79f--