hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Haibo Chen (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (MAPREDUCE-6131) Integer overflow in RMContainerAllocator results in starvation of applications
Date Thu, 21 Jul 2016 03:46:20 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Haibo Chen resolved MAPREDUCE-6131.
-----------------------------------
    Resolution: Invalid

> Integer overflow in RMContainerAllocator results in starvation of applications
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6131
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6131
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Kamal Kc
>         Attachments: MAPREDUCE-6131-2.2.0.patch
>
>
> When processing large datasets, Hadoop encounters a scenario where all
>  containers run reduce tasks and no map tasks are scheduled. The 
> application does not fail but rather remains in this state without making 
> any forward progress. It then has to be manually terminated. 
> This bug is due to integer overflow in scheduleReduces() of 
> RMContainerAllocator. The variable netScheduledMapMem overflows for 
> large data sizes, takes negative value, and results in a large 
> finalReduceMemLimit and a large rampup value. In almost all cases, this 
> large rampup value is greater than the total number of reduce tasks. 
> Therefore, the AM tries to assign all reduce tasks. And if the total number 
> of reduce tasks is greater than the total container slots, then all slots are 
> taken up by reduce tasks, leaving none for maps. 
> With 128MB block size and 2GB map container size, overflow occurs with 128 TB data size.
An example scenario for the reproduction is: 
> - Input data size of 32TB, block size 128MB, Map container size = 10GB,
> reduce container size = 10GB, #reducers = 50,  cluster mem capacity =  7 x 40GB, slowstart=0.0
> Better resolution might be to change the variables used in 
> RMContainerAllocator from int to long. A simpler fix instead would be to 
> only change the local variables of scheduleReduces() to long data types. 
> Patch is attached for 2.2.0. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org


Mime
View raw message