hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-5584) ShuffleHandler becomes unresponsive during gridmix runs and can leak file descriptors
Date Tue, 15 Oct 2013 15:36:43 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jason Lowe updated MAPREDUCE-5584:
----------------------------------

    Priority: Blocker  (was: Major)

The reducers were timing out attempting to contact certain nodes for their map inputs.  Simple
GET probes to the shuffle port on these nodes showed that they were indeed totally unresponsive.
 Examination of the nodes showed that they had leaked a significant number of file descriptors
with sockets in the CLOSE_WAIT state.

The jstacks of the NodeManager processes on these nodes also showed that all of the Netty
handlers were stuck somewhere in LocalDirAllocator.getLocalPathToRead.  They were either stuck
on the synchronized lock or waiting for the results of fs.exists() to return which now forks
and execs {{stat}} since HADOOP-9652.

> ShuffleHandler becomes unresponsive during gridmix runs and can leak file descriptors
> -------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5584
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5584
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Priority: Blocker
>
> While running gridmix on 2.3 we noticed that jobs are running much slower than normal.
 We tracked this down to reducers having difficulties shuffling data from maps.  Details to
follow.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message