Return-Path: X-Original-To: apmail-giraph-dev-archive@www.apache.org Delivered-To: apmail-giraph-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 31C2ED214 for ; Fri, 17 Aug 2012 00:35:39 +0000 (UTC) Received: (qmail 35877 invoked by uid 500); 17 Aug 2012 00:35:39 -0000 Delivered-To: apmail-giraph-dev-archive@giraph.apache.org Received: (qmail 35457 invoked by uid 500); 17 Aug 2012 00:35:38 -0000 Mailing-List: contact dev-help@giraph.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@giraph.apache.org Delivered-To: mailing list dev@giraph.apache.org Received: (qmail 35420 invoked by uid 500); 17 Aug 2012 00:35:38 -0000 Delivered-To: apmail-incubator-giraph-dev@incubator.apache.org Received: (qmail 35412 invoked by uid 99); 17 Aug 2012 00:35:38 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Aug 2012 00:35:38 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 36E342C5BE5 for ; Fri, 17 Aug 2012 00:35:38 +0000 (UTC) Date: Fri, 17 Aug 2012 11:35:38 +1100 (NCT) From: "Eli Reisman (JIRA)" To: giraph-dev@incubator.apache.org Message-ID: <1801901178.22028.1345163738225.JavaMail.jiratomcat@arcas> In-Reply-To: <4192876.36357.1342025854696.JavaMail.jiratomcat@issues-vm> Subject: [jira] [Updated] (GIRAPH-246) Periodic worker calls to context.progress() will prevent timeout on some Hadoop clusters during barrier waits MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eli Reisman updated GIRAPH-246: ------------------------------- Attachment: GIRAPH-246-NEW-FIX-2.patch This includes one more extra call to progress() and is otherwise just like the first "NEW-FIX" patch. Neither trunk nor any of these PredicateLock fixes allow std err and std out logs reach the HTML Mapper Detail pages for some reason, the old rebase does. The only times my old rebase blew up and "timed out" today is when Netty blew up under stress tests, as the Mapper Details revealed. This had "timeouts" occasionally in the same circumstances but there was no log to confirm why (which is why I doubted the earlier "new fix" patch.) We can commit this and hope it was just Netty blowing up during stress tests, if we insist on using a PredicateLock solution. This as closely replicates the good behavior of the old rebase patch as any I've tested yet. Alternately, I can upload the latest rebase of the original 246 and we can commit that and wait for a better fix, but if you guys are having trouble replicating this problem then chances are shooting in the dark for a fix to replace the rebase will not happen, and I know you don't like seeing the progress calls in BspServiceWorker. Either way is fine at this point, I will be running a lot of code over the next few days/nights as I get clear windows to do so, but I am 100% sure after running trunk a bunch today as I was doing A/B tests on these that the current fix does not work for us here. I'd love to pick one (the rebase or this) and commit, its at least a step forward from what we have which is just not functioning for us at all. users here are getting sick of manually adding patches to run Giraph on our job queue ;) > Periodic worker calls to context.progress() will prevent timeout on some Hadoop clusters during barrier waits > ------------------------------------------------------------------------------------------------------------- > > Key: GIRAPH-246 > URL: https://issues.apache.org/jira/browse/GIRAPH-246 > Project: Giraph > Issue Type: Improvement > Components: bsp > Affects Versions: 0.2.0 > Reporter: Eli Reisman > Assignee: Eli Reisman > Priority: Minor > Labels: hadoop, patch > Fix For: 0.2.0 > > Attachments: GIRAPH-246-10.patch, GIRAPH-246-11.patch, GIRAPH-246-1.patch, GIRAPH-246-2.patch, GIRAPH-246-3.patch, GIRAPH-246-4.patch, GIRAPH-246-5.patch, GIRAPH-246-6.patch, GIRAPH-246-7.patch, GIRAPH-246-7_rebase1.patch, GIRAPH-246-8.patch, GIRAPH-246-9.patch, GIRAPH-246-NEW-FIX-2.patch, GIRAPH-246-NEW-FIX.patch > > > This simple change creates a command-line configurable option in GiraphJob to control the time between calls to context().progress() that allows workers to avoid timeouts during long data load-ins in which some works complete their input split reads much faster than others, or finish a super step faster. I found this allowed jobs that were large-scale but with low memory overhead to complete even when they would previously time out during runs on a Hadoop cluster. Timeout is still possible when the worker crashes or runs out of memory or has other GC or RPC trouble that is legitimate, but prevents unintentional crashes when the worker is actually still healthy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira