giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Reisman (JIRA)" <>
Subject [jira] [Commented] (GIRAPH-246) Periodic worker calls to context.progress() will prevent timeout on some Hadoop clusters during barrier waits
Date Thu, 09 Aug 2012 16:13:19 GMT


Eli Reisman commented on GIRAPH-246:

The version of Hadoop we test on is not without its own patches. I agree with both of you.
This whole thing makes me uncomfortable. I would be most comfortable with a fix that we understood
the workings of.

Shortly after the discussion I had with Claudio and Maya about the disk spill code is the
first time I had any impression everyone was not attempting to scale this out due to the fact
that we ride on Hadoop, BSP has "Bulk" in it etc. After that discussion, I realized perhaps
others were not seeing the stuff that was bothering me because we weren't all testing on the
same scale of hardware. Since then, as this issue has come up, I have assumed no one was verifying
this stuff but me.

I am not comfortable being the single source of ground truth on this, especially if others
feel strongly about it. All along, I assumed there would come a point where someone would
stop arguing and just run some large jobs on some large data to verify at least someone else
could either make one of the other fixes work, or not.

Now (and for the short but immediate future) none of us are able to schedule large jobs and
do this (I have been trying, believe me.) I do have a rebased patch that works for us. We
have been using it for a while now with success. If you want to wait and put this in later
when others encounter the same problem, or if someone is being proactive about fixing the
problem soon, I'm totally fine with that.

My only horse in this race is getting jobs to run. In the short term, I have a way to do that
for now with the 246-11 or the other rebase. So I'm starting to feel like barring some alternate
ground truth coming into the picture and settling this, I'm fine handling this however we
like, as long as its short term window.

> Periodic worker calls to context.progress() will prevent timeout on some Hadoop clusters
during barrier waits
> -------------------------------------------------------------------------------------------------------------
>                 Key: GIRAPH-246
>                 URL:
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>              Labels: hadoop, patch
>             Fix For: 0.2.0
>         Attachments: GIRAPH-246-1.patch, GIRAPH-246-10.patch, GIRAPH-246-11.patch, GIRAPH-246-2.patch,
GIRAPH-246-3.patch, GIRAPH-246-4.patch, GIRAPH-246-5.patch, GIRAPH-246-6.patch, GIRAPH-246-7.patch,
GIRAPH-246-8.patch, GIRAPH-246-9.patch
> This simple change creates a command-line configurable option in GiraphJob to control
the time between calls to context().progress() that allows workers to avoid timeouts during
long data load-ins in which some works complete their input split reads much faster than others,
or finish a super step faster. I found this allowed jobs that were large-scale but with low
memory overhead to complete even when they would previously time out during runs on a Hadoop
cluster. Timeout is still possible when the worker crashes or runs out of memory or has other
GC or RPC trouble that is legitimate, but prevents unintentional crashes when the worker is
actually still healthy.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message