hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors
Date Tue, 08 Sep 2015 14:53:46 GMT

    [ https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734931#comment-14734931
] 

Jason Lowe commented on YARN-2410:
----------------------------------

Thanks for updating the patch!

bq. The only reason was findbugs which does not allow more than 7 parameters in a function
call
Normally a builder pattern is used to make the code more readable in those situations.  However
I don't think we need more than 7.  ReduceContext really only needs mapIds, reduceId, channelCtx,
user, infoMap, and outputBasePathStr.  The other two parameters are either known to be zero
(should not be passed) and can be derived from another (size of mapIds).  As such we don't
need SendMapOutputParams.

bq. The reduceContext is a variable holds the value set by the setAttachment() method and
is used by the getAttachment() answer. If I declare it in the test method, it needs be final
which cannot be done due to it being used by the setter.
createMockChannel can simply have a ReduceContext parameter, marked final, and that should
solve that problem.  But I thought we were getting rid of the use of channel attachments and
just associating the context with the listener directly?

Related to the last comment, we're still using channel attachments.  sendMap can just take
a ReduceContext parameter, and the listener can provide its context when calling it.  No need
for channel attachments.

This can NPE since we're checking for null after we already use it:
{noformat}
+            nextMap = sendMapOutput(
+                reduceContext.getSendMapOutputParams().getCtx(),
+                reduceContext.getSendMapOutputParams().getCtx().getChannel(),
+                reduceContext.getSendMapOutputParams().getUser(), mapId,
+                reduceContext.getSendMapOutputParams().getReduceId(), info);
+            nextMap.addListener(new ReduceMapFileCount(reduceContext));
+            if (null == nextMap) {
{noformat}

maxSendMapCount should be cached during serviceInit like the other conf-derived settings so
we aren't doing conf lookups on every shuffle.

The indentation in sendMap isn't correct, as code is indented after a conditional block at
the same level as the contents of the conditional block.  There's other places that are over-indented.

MockShuffleHandler only needs to override one thing, getShuffle, but the mock that method
returns has to override a bunch of stuff.  It makes more sense to create a separate class
for the mocked Shuffle than the mocked ShuffleHandler.

Should the mock Future stuff be part of creating the mocked channel?  Can simply pass the
listener list to use as an arg to the method that mocks up the channel.

Nit: SHUFFLE_MAX_SEND_COUNT should probably be something like SHUFFLE_MAX_SESSION_OPEN_FILES
to better match the property name.  Similarly maxSendMapCount could have a more appropriate
name.

Nit: Format for 80 columns

Nit: There's still instances where we have a class definition immediately after variable definitions
and a lack of whitespace between classes and methods or between methods. Whitespace would
help readability in those places.

> Nodemanager ShuffleHandler can possible exhaust file descriptors
> ----------------------------------------------------------------
>
>                 Key: YARN-2410
>                 URL: https://issues.apache.org/jira/browse/YARN-2410
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: Nathan Roberts
>            Assignee: Kuhu Shukla
>         Attachments: YARN-2410-v1.patch, YARN-2410-v2.patch, YARN-2410-v3.patch, YARN-2410-v4.patch,
YARN-2410-v5.patch, YARN-2410-v6.patch
>
>
> The async nature of the shufflehandler can cause it to open a huge number of
> file descriptors, when it runs out it crashes.
> Scenario:
> Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node.
> Let's say all 6K reduces hit a node at about same time asking for their
> outputs. Each reducer will ask for all 40 map outputs over a single socket in a
> single request (not necessarily all 40 at once, but with coalescing it is
> likely to be a large number).
> sendMapOutput() will open the file for random reading and then perform an async transfer
of the particular portion of this file(). This will theoretically
> happen 6000*40=240000 times which will run the NM out of file descriptors and cause it
to crash.
> The algorithm should be refactored a little to not open the fds until they're
> actually needed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message