giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Reisman (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (GIRAPH-246) Periodic worker calls to context.progress() will prevent timeout on some Hadoop clusters during barrier waits
Date Fri, 17 Aug 2012 19:03:38 GMT

    [ https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436974#comment-13436974
] 

Eli Reisman commented on GIRAPH-246:
------------------------------------

I have run several more large-scale test and can now confirm, something else is wrong now.
I still believe the 246-NEW-FIX-2.patch keeps the calls in the predicate lock as avery prefers
and performs as well as the old patch did. This solves several easy timeout cases and should
probably go in so we can get past the current situation which is failing at numerous times
during different size runs.

The bad news: there is a new problem that is effecting all the patches, even the rebases.
If a worker does not read an input split and gets past 600 seconds of "idle time" during INPUT_SUPERSTEP,
it will time out. The original patch solved this issue here for a month or more, but now none
of them solve it.

There is a lot of new code committed, perhaps something has had this side effect. Whether
the progress calls come from inside the locking code or in BspServiceWorker, the fail occurs.
The timeout length during the INPUT_SUPERSTEP barrier wait does not effect this situation
either. Something else is going on here now. I have instrumented the code to see that the
idle workers ARE waking up from the barrier and scanning the list again before going to sleep
(not far enough to get a split, as GIRAPH-301 is attempting to address) but if they don't
end up executing split-reading code before the timeout arrives, these periodic wake ups are
not enough to avoid timeouts. This is independent of how many progress() calls occur during
said wake-ups.

Because the netty channel pipeline is in its own thread pool, as long as BspServiceWorker
is calling progress() often enough, lack of calls in Netty code should not make any difference.
It seems to me now, progress calls from inside BspServiceWorker are not making it out. I notice
the master has no problem in barrier waits longer than the ones these idle workers experience,
even when it is effectively idle and waiting.

This whole issue has defied simple explanations for a long time now, and I was just happy
to have a fix that seemed to solve it. But at this point, I would really like get to the bottom
of this, because it will once again directly block us up here. I realize others who have seen
this issue are encountering problems at other phases of the job workflow, but if anyone has
ideas I'd like to hear them. It seems like going in the wrong direction for all workers to
waste a thread to call progress all the time since we share our grid with lots of concurrent
jobs and threads are at a premium here. Seems like there's a simple reason for this problem
cropping up. Manually setting the hadoop timeouts is also not an option for us here. So a
Giraph-internal solution that lets bad jobs die and keeps good ones alive is really what we
need.

If anyone has further input on this topic, please feel free to jump in, I'd like to get this
solved for all cases and move forward!




                
> Periodic worker calls to context.progress() will prevent timeout on some Hadoop clusters
during barrier waits
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-246
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-246
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>              Labels: hadoop, patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-246-10.patch, GIRAPH-246-11.patch, GIRAPH-246-1.patch, GIRAPH-246-2.patch,
GIRAPH-246-3.patch, GIRAPH-246-4.patch, GIRAPH-246-5.patch, GIRAPH-246-6.patch, GIRAPH-246-7.patch,
GIRAPH-246-7_rebase1.patch, GIRAPH-246-7_rebase2.patch, GIRAPH-246-8.patch, GIRAPH-246-9.patch,
GIRAPH-246-NEW-FIX-2.patch, GIRAPH-246-NEW-FIX.patch
>
>
> This simple change creates a command-line configurable option in GiraphJob to control
the time between calls to context().progress() that allows workers to avoid timeouts during
long data load-ins in which some works complete their input split reads much faster than others,
or finish a super step faster. I found this allowed jobs that were large-scale but with low
memory overhead to complete even when they would previously time out during runs on a Hadoop
cluster. Timeout is still possible when the worker crashes or runs out of memory or has other
GC or RPC trouble that is legitimate, but prevents unintentional crashes when the worker is
actually still healthy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message