giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Reisman (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (GIRAPH-246) Periodic worker calls to context.progress() will prevent timeout on some Hadoop clusters during barrier waits
Date Mon, 13 Aug 2012 18:33:38 GMT

     [ https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Eli Reisman updated GIRAPH-246:
-------------------------------

    Attachment: GIRAPH-246-7_rebase1.patch

This is patch is a rebasing of the "revert 267" code, its not meant for inclusion as the general
consensus is we want to keep the PredicateLock code in the codebase when we get a final fix
for this. I am uploading for convenience as people here needed a rebase of this. I still have
no idea why this one seems to work for us.

In general, as our ops guy noticed in like 3 minutes of watching a run (that later timed out
:( ), for one reason or another regardless of approach, Giraph is clearly not sending enough
progress messages to Hadoop despite a fair amount of calls.

This observation was made under trunk with the currently implemented PredicateLock calls,
but as Avery mentioned this should not matter. I did mention a few theories on the GIRAPH-274
thread regarding this. Chief among them is that when you call context.progress() my understanding
is it logs the calls, and only actually calls out to Hadoop after some fixed number of them
have been logged. Given this, perhaps we should have PredicateLock's waitMsecs() and waitForever()
both break looks ever so often and re-call progress so that in a given 10 minute (the Hadoop
timeout default) window we get enough in to force a "real" call on the wire to Hadoop.

I have had a lot of trouble finding a clear moment to run any of these candidates as our clusters
have been piled up this last week, but I look forward to trying more of these and hopefully
finding one that retains the PredicateLock code and also does not time out.

If I make any progress, I'll throw another patch on the pile here. Ideas welcome on this one.
Calling all Hadoop MR specialists...

                
> Periodic worker calls to context.progress() will prevent timeout on some Hadoop clusters
during barrier waits
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-246
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-246
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>              Labels: hadoop, patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-246-10.patch, GIRAPH-246-11.patch, GIRAPH-246-1.patch, GIRAPH-246-2.patch,
GIRAPH-246-3.patch, GIRAPH-246-4.patch, GIRAPH-246-5.patch, GIRAPH-246-6.patch, GIRAPH-246-7.patch,
GIRAPH-246-7_rebase1.patch, GIRAPH-246-8.patch, GIRAPH-246-9.patch
>
>
> This simple change creates a command-line configurable option in GiraphJob to control
the time between calls to context().progress() that allows workers to avoid timeouts during
long data load-ins in which some works complete their input split reads much faster than others,
or finish a super step faster. I found this allowed jobs that were large-scale but with low
memory overhead to complete even when they would previously time out during runs on a Hadoop
cluster. Timeout is still possible when the worker crashes or runs out of memory or has other
GC or RPC trouble that is legitimate, but prevents unintentional crashes when the worker is
actually still healthy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message