giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Reisman (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (GIRAPH-301) InputSplit Reservations are clumping, leaving many workers asleep while other process too many splits and get overloaded.
Date Wed, 15 Aug 2012 16:48:38 GMT

    [ https://issues.apache.org/jira/browse/GIRAPH-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435284#comment-13435284
] 

Eli Reisman commented on GIRAPH-301:
------------------------------------

Hey Alessandro,

Yes I think on 4, that was the idea, and it makes sense. Problem is, if you only check one
possibly available node before deciding to sleep, and then start reading the list again from
index 0 of the split list, you contend with other workers whenever they finish a node and
wake you up to keep iterating. Whoever gets the first open slot, the others fail to claim
it and go back to sleep instead of continuing to iterate.

Worse, when you read big data on each node and they take a long time, other nodes time out
every minute or so and jump back in to attempt to claim a node. Awakened nodes iterate again,
and since many other nodes are reading big splits already, the first one they encounter that
has a RESERVED split, they don't claim it successfully, and back to sleep they go. So you're
back to this problem of everyone (including workers who finish a split and try to iterate
for a new one) going back to sleep way too eagerly. I have seen this behavior happening no
matter how I set splitmb and -w since I started using Giraph, and I have been puzzled why
I couldn't trick some (often many) workers into doing something when there was enough work
to go around.

Users here started emailing about this clumping effect, and I had noticed it many times over
the last few months. The situation I describe above is with the new locality patch making
some workers read very fast (and overload trying to send out all the data as they pick up
new splits like crazy) but this clumping of split-reading activity and groups of workers sleeping
through the whole input phase has been happening as long as I've been using Giraph.

My cluster is down this morning for upgrades but but I hope to be back up and running this
afternoon/tonight. The tests of this I ran before putting the patch up worked well: I could
get just the behavior I had always expected by doing

 (# of MB of data) / (giraph.splitmb) == (# of workers you should see busy right away reading
splits, if you select that many or more with -w) 

Which is, 1 split per worker right from the get-go. Other manipulations of the formula obviously
split out the way one would expect when skewing in favor of extra splits or extra workers
(i.e. no clumping when 50 workers, 100 splits -- almost all read 2 splits, not some reading
3-4 and some reading 0 like before)

So it comes down to your first point: is it bad to load up the zookeeper quorum with potentially
reads like this? After reading both ZK papers and having this problem to think about when
I added the locality patch, my opinion is "no" this is what ZK is absolutely designed for.
Having a quorum of ZK's to split the read requests definitely helps, but on most clusters
this is a minimum of 3 servers. This does bear more testing, of course.

The time when slowing or problems can happen is during writes. This patch tries to at least
mitigate that a bit by not bothering to try to create the claim node unless we have a hint
that the node is not already created. This will not be useful on the first pass when everyone
is vying for nodes, but after any awakening from sleep, it is quite likely since, as of the
locality patch, many work's split lists are not ordered the same any more and they may not
encounter the same unclaimed nodes right away as they iterate.

                
> InputSplit Reservations are clumping, leaving many workers asleep while other process
too many splits and get overloaded.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-301
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-301
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp, graph, zookeeper
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>              Labels: patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-301-1.patch
>
>
> With recent additions to the codebase, users here have noticed many workers are able
to load input splits extremely quickly, and this has altered the behavior of Giraph during
INPUT_SUPERSTEP when using the current algorithm for split reservations. A few workers process
multiple splits (often overwhelming Netty and getting GC errors as they attempt to offload
too much data too quick) while many (often most) of the others just sleep through the superstep,
never successfully participating at all.
> Essentially, the current algo is:
> 1. scan input split list, skipping nodes that are marked "Finsihed"
> 2. grab the first unfinished node in the list (reserved or not) and check its reserved
status.
> 3. if not reserved, attempt to reserve & return it if successful.
> 4. if the first one you check is already taken, sleep for way too long and only wake
up if another worker finishes a split, then contend with that worker for another split, while
the majority of the split list might sit idle, not actually checked or claimed by anyone yet.
> This does not work. By making a few simple changes (and acknowledging that ZK reads are
cheap, only writes are not) this patch is able to get every worker involved, and keep them
in the game, ensuring that the INPUT_SUPERSTEP passes quickly and painlessly, and without
overwhelming Netty by spreading the memory load the split readers bear more evenly. If the
giraph.splitmb and -w options are set correctly, behavior is now exactly as one would expect
it to be.
> This also results in INPUT_SUPERSTEP passing more quickly, and survive the INPUT_SUPERSTEP
for a given data load on less Hadoop memory slots.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message