giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Reisman (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (GIRAPH-246) Periodic worker calls to context.progress() will prevent timeout on some Hadoop clusters during barrier waits
Date Wed, 08 Aug 2012 02:23:10 GMT

    [ https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13430811#comment-13430811
] 

Eli Reisman commented on GIRAPH-246:
------------------------------------

Sure if thats what you want. I am not wedded to this solution, I am just nervous about having
to miss testing windows while we figure out why the lock fix isn't working a 2nd time. I will
attempt to repair that solution and try it out, barring that I can at least leave the context
object available and eliminate the Progressable interface, which apparently is not actually
calling Context. I'll have it up tonight/tomorrow.

BTW I realize we are pushing for a release, I am not sure of your window, but I am letting
all these patches that have built up get committed and I can then swoop in with 214 (the GiraphConf)
as the dust settles. I haven't forgotten about it!

                
> Periodic worker calls to context.progress() will prevent timeout on some Hadoop clusters
during barrier waits
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-246
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-246
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>              Labels: hadoop, patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-246-1.patch, GIRAPH-246-2.patch, GIRAPH-246-3.patch, GIRAPH-246-4.patch,
GIRAPH-246-5.patch, GIRAPH-246-6.patch, GIRAPH-246-7.patch
>
>
> This simple change creates a command-line configurable option in GiraphJob to control
the time between calls to context().progress() that allows workers to avoid timeouts during
long data load-ins in which some works complete their input split reads much faster than others,
or finish a super step faster. I found this allowed jobs that were large-scale but with low
memory overhead to complete even when they would previously time out during runs on a Hadoop
cluster. Timeout is still possible when the worker crashes or runs out of memory or has other
GC or RPC trouble that is legitimate, but prevents unintentional crashes when the worker is
actually still healthy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message