giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Reisman (JIRA)" <>
Subject [jira] [Updated] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
Date Mon, 20 Aug 2012 01:23:37 GMT


Eli Reisman updated GIRAPH-301:

    Attachment: GIRAPH-301-6.patch

This patch has been well tested up to the largest scales we use here, and functions even better
than 306-5. I feel I should explain it as its methods might be a bit controversial.

A large data load under our constraints here took a certain 4 figure number of workers, and
over 75 minutes without locality. With locality, this was reduced to 20 minutes and hundreds
less workers required. With 301-5, this is lowered to 400 less workers and 15 minutes. Using
this patch, this same data load in takes under 4 minutes, and 400 less workers than the original

The controversial part is this: as stated in earlier posts on this thread, instrumented runs
while experimenting with scale this weekend have revealed that even speeding up data load
in using locality and other changes in the 301 patches does not end the INPUT_SUPERSTEP as
soon as it could, or completely eliminate the "clumping" effect described above.

The reason for this clumping turns out to be, that while ZK can handle large read throughput,
the quorum must sync after writes before servicing a backup of many concurrent reads. Since
both the FINISHED and RESERVED znode lists are being queries in all iterations on every worker,
and also being mutated as splits are claimed and completed, the workers that never get a split
are not sleeping throughout the input step, but in fact very, VERY slowly iterating their
input split list. In some cases, the step ends before they have finished one single iteration,
even if the input superstep goes on for 30 or more minutes!

This patch (301-6) dramatically speeds this up by removing the checks for FINSIHED znodes.
The nodes are still created whenever a split is finished by a worker, so that the master knows
when to end the barrier and begin the first calculation superstep. There is no danger of BSP
barriers being tampered with. Further, every worker must read the whole list of splits at
least once from the top and register every node as RESERVED before it stops trying to read
any additional splits. Therefore, if a worker dies in mid-read, its ephemeral RESERVED node
disappears, and others could possibly claim it, since every node must still do one full iteration
on the list finding all splits RESERVED before ending its search for good and waiting on the
barrier for superstep 0.

This means that the only danger of data loss would be if the very last worker to iterate fails
during a split read. In this case, the next superstep will never come (as the split is never
marked FINISHED) and the job fails anyway. If a worker dies in Giraph after marking a split
FINISHED there is currently no algorithm in place to restore order to the calculation, even
if the worker could restart and recover, so no harm done on this common failure by the changes

In actual fact, the real story is any worker failing and restarting during the INPUT_SUPERSTEP
currently causes cascading failure to the job. Until we have a more comprehensive plan for
worker failure of this sort, there is no danger whatsoever to this large optimization in the
network load and speed during input superstep that comes by having the workers evaluate whether
to keep iterating on the input list based on every split being RESERVED rather than FINISHED.
I have added comments to BspServiceWorker#reserveInputSplit() where the changes are coded
to annotate that, should the recovery story for Giraph change in the future, this algorithm
optimization should be revisited.

Again, I have run this to happy completion many times today and can vouch that it causes no
problems for Giraph as-is. If everyone is comfortable with this change, I think the reduced
cost to network (literally cuts ZK reads from all workers during input phase in half) and
the reduced time to finish the superstep are well worth it.

> InputSplit Reservations are clumping, leaving many workers asleep while other process
too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>                 Key: GIRAPH-301
>                 URL:
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>         Attachments: GIRAPH-301-1.patch, GIRAPH-301-2.patch, GIRAPH-301-3.patch, GIRAPH-301-4.patch,
GIRAPH-301-5.patch, GIRAPH-301-6.patch
> With recent additions to the codebase, users here have noticed many workers are able
to load input splits extremely quickly, and this has altered the behavior of Giraph during
INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process
multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload
too much data too quick) while many (often most) of the others just sleep through the superstep,
never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its reserved
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and only wake
up if another worker finishes a split, then contend with that worker for another split, while
the majority of the split list might sit idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK reads are
cheap, only writes are not) this patch is able to get every worker involved, and keep them
in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without
overwhelming Netty by spreading the memory load the split readers bear more evenly. If the
giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect
it to be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP
for a given data load on less Hadoop memory slots.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message