giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Reisman (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (GIRAPH-730) GiraphApplicationMaster race condition in resource loading
Date Sun, 01 Sep 2013 06:30:53 GMT

    [ https://issues.apache.org/jira/browse/GIRAPH-730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13755667#comment-13755667
] 

Eli Reisman commented on GIRAPH-730:
------------------------------------

My question is this: the AM is running as a single thread, and then makes a request for all
the containers it needs in one lump. In my tests, what happened was this callback (by the
RM giving the local AM all the containers it was asked for) returned the whole bunch of containers
at once, but this call is made asynchronously.

However, once the callback produced all the requested containers (always in a single asynchronous
callback), the same single AM thread is what iterated through the collection of containers,
one at a time, populating them with metadata and the resource map in buildContainerLaunchContext.
So there was no concurrency issue.

BUT, I think now that you are running on a larger cluster and asking for more containers,
they are being returned in smaller groups. Perhaps you ask for 500 workers and instead you
get two asynchronous callbacks from the RM, one with 200 and one with 300 containers, and
both of _those_ asyncronous calls returning the groups of containers are now racing into buildContainerLaunchContext
(etc) and this is where the concurrency issue arises?

YARN certainly does not guarantee you can get back all the containers you ask for at once,
although in my tests I didn't see any behavior but this. If at your scale you are seeing this
problem, we need to address it. Good catch!

If this is what is happening (you have logged one AM ask for X containers resulting in more
than one asynchronous callback returning A, B, and C # of containers, where A+B+C = X) then
we need to fix this.

But, I do think we should not risk going with a partial solution. If what you're describing
and what I am describing above match up, we really should just eliminate this risk now by
protecting buildContainerContext, or concurrent attempts to populate the launch container
contexts with id's etc could overwrite each other, and containers could be lost on the AM
side this way.

It doesn't mean we have to just slap a "synchronized" block on to buildContainerLaunchContext,
maybe something more subtle could work. But we probably should address the problem so that
all the risk is gone.

What do you think? Maybe try another patch that addresses all possible race risks here? Also,
if the race you're seeing is not as I have described here, please let me know what the real
concern is, maybe I missed your point.

Thanks, great work!
                
> GiraphApplicationMaster race condition in resource loading
> ----------------------------------------------------------
>
>                 Key: GIRAPH-730
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-730
>             Project: Giraph
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>         Environment: Giraph with Yarn
>            Reporter: Chuan Lei
>            Assignee: Chuan Lei
>         Attachments: GIRAPH-730.v1.patch
>
>
> In GiraphApplicationMaster.java, getTaskResourceMap function is not multi-thread safe,
which causes the application master fail to distribute the resources (jar, configuration file,
etc.) to each container.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message