giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Reisman (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (GIRAPH-246) Periodic worker calls to context.progress() will prevent timeout on some Hadoop clusters during barrier waits
Date Wed, 08 Aug 2012 06:45:10 GMT

    [ https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13430908#comment-13430908
] 

Eli Reisman commented on GIRAPH-246:
------------------------------------

Avery,

The Progressible not working is Jakob's theory, not mine, you'd have to ask him, I'm just
trying to make the patch work. As I've said, the one I posted here and as 274-alt-1 are the
only ones I have verified to run long jobs without timing out when the run is still healthy.
Thats all I know.

I suspect after attempting 246-8 and 9 that maybe some of the longer timed waitMsecs calls
were not calling progress often enough. This whole problem doesn't make a lot of sense to
me. None of the solutions are very different. One concern that I have had since originally
solving this is that you really don't know if you've fixedit until you've gotten some long
runs to happen, long enough to trigger time outs, and healthy enough that the timeout isn't
because a worker died and tried to restart rather than actually timed out during healthy work/waits.
Thats what I'm trying to do with 246-9 right now.

I'm happy with any way forward that is verified to work. The fact that any of these fixes
don't have the same effect means there are probably bigger problems to solve to really find
a satisfying answer, but in the meantime I need long jobs to not time out. Thats about as
far as my thought process has gotten on this one. Whatever you guys think is best is how we
should move forward, but I am uncomfortable just guessing at this one, as it tends to come
back and bite you when you need that big job to finish ;)

The first of you guys to come up with verified solution that you find palatable is entitled
to some beer on my behalf. In fact, consider yourselves both entitled already ;)
 
                
> Periodic worker calls to context.progress() will prevent timeout on some Hadoop clusters
during barrier waits
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-246
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-246
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>              Labels: hadoop, patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-246-1.patch, GIRAPH-246-2.patch, GIRAPH-246-3.patch, GIRAPH-246-4.patch,
GIRAPH-246-5.patch, GIRAPH-246-6.patch, GIRAPH-246-7.patch, GIRAPH-246-8.patch, GIRAPH-246-9.patch
>
>
> This simple change creates a command-line configurable option in GiraphJob to control
the time between calls to context().progress() that allows workers to avoid timeouts during
long data load-ins in which some works complete their input split reads much faster than others,
or finish a super step faster. I found this allowed jobs that were large-scale but with low
memory overhead to complete even when they would previously time out during runs on a Hadoop
cluster. Timeout is still possible when the worker crashes or runs out of memory or has other
GC or RPC trouble that is legitimate, but prevents unintentional crashes when the worker is
actually still healthy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message