giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Reisman (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (GIRAPH-246) Periodic worker calls to context.progress() will prevent timeout on some Hadoop clusters during barrier waits
Date Fri, 17 Aug 2012 17:00:38 GMT

    [ https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436866#comment-13436866
] 

Eli Reisman commented on GIRAPH-246:
------------------------------------

More test this morning. The 246-NEW-FIX-2.patch calls progress() every 10 seconds regardless
of variable-length timed waits in waitMsecs as this patch sets up, or in waitForever() as
Jaeho set it up to do in 267 and trunk does already. I think this is ready to go.

In other news, I think while stress testing this and scaling it up, I might have found another
place progress needs to be called more often: in the netty channel pipelines handling send
and receive during the input superstep as collections of vertices are sent to their future
homes. I will try to get more instrumented runs in this morning if I can to get more details,
but something weird is going on when a worker is not reading a split but does start to receive
its partition data over Netty that is causing a timeout. I don't know if that timing is co-incidental
but a strange timeout during large-scale runs is happening consistently on such worker nodes.
Often times when I can get log data on such a timeout, it is not a healthy worker timing out
but one where netty is overwhelmed and it has genuinely died. This might be more appropriate
in another JIRA, or perhaps Avery is already aware of this and has wrapped it up into his
next Netty improvement? Either way I will try to get more details on what is happening here
and repeat the problem. This is running on today's trunk too, so the GIRAPH-300 improvements
are already in as of this problem showing up.

                
> Periodic worker calls to context.progress() will prevent timeout on some Hadoop clusters
during barrier waits
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-246
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-246
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>              Labels: hadoop, patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-246-10.patch, GIRAPH-246-11.patch, GIRAPH-246-1.patch, GIRAPH-246-2.patch,
GIRAPH-246-3.patch, GIRAPH-246-4.patch, GIRAPH-246-5.patch, GIRAPH-246-6.patch, GIRAPH-246-7.patch,
GIRAPH-246-7_rebase1.patch, GIRAPH-246-7_rebase2.patch, GIRAPH-246-8.patch, GIRAPH-246-9.patch,
GIRAPH-246-NEW-FIX-2.patch, GIRAPH-246-NEW-FIX.patch
>
>
> This simple change creates a command-line configurable option in GiraphJob to control
the time between calls to context().progress() that allows workers to avoid timeouts during
long data load-ins in which some works complete their input split reads much faster than others,
or finish a super step faster. I found this allowed jobs that were large-scale but with low
memory overhead to complete even when they would previously time out during runs on a Hadoop
cluster. Timeout is still possible when the worker crashes or runs out of memory or has other
GC or RPC trouble that is legitimate, but prevents unintentional crashes when the worker is
actually still healthy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message