hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "YunFan Zhou (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-6944) MR job got hanged forever when some NMs unstable for some time
Date Mon, 21 Aug 2017 07:36:00 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-6944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

YunFan Zhou updated MAPREDUCE-6944:
-----------------------------------
    Description: 
We encountered several jobs in the production environment due to the fact that some of the
NM unstable cause one *MAP* of the job to be stuck there, and the job can't finish properly.
However, the problems we encountered were different from those mentioned in [https://issues.apache.org/jira/browse/MAPREDUCE-6513].
 Because in our scenario, all of *MR REDUCEs* does not start executing.
But when I manually kill the hanged *MAP*, the job will be finished normally.


{noformat}
2017-08-17 12:25:06,548 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Reduce slow start threshold not met. completedMapsForReduceSlowstart 15564
2017-08-17 12:25:07,555 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Received completed container container_e84_1502793246072_73922_01_015700
2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Recalculating schedule, headroom=<memory:2218677, vCores:2225>
2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Reduce slow start threshold not met. completedMapsForReduceSlowstart 15564
2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 AssignedReds:0
CompletedMaps:15563 CompletedReds:0 ContAlloc:15723 ContRel:26 HostLocal:4575 RackLocal:8121
{noformat}


{noformat}
2017-08-17 14:49:41,793 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Before Scheduling: PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:1 AssignedReds:0
CompletedMaps:15563 CompletedReds:0 ContAlloc:15724 ContRel:26 HostLocal:4575 RackLocal:8121
2017-08-17 14:49:41,794 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor:
Applying ask limit of 1 for priority:5 and capability:<memory:1024, vCores:1>
2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor:
getResources() for application_1502793246072_73922: ask=1 release= 0 newContainers=0 finishedContainers=0
resourcelimit=<memory:1711989, vCores:1688> knownNMs=4236
2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Recalculating schedule, headroom=<memory:1711989, vCores:1688>
2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Reduce slow start threshold not met. completedMapsForReduceSlowstart 15564
2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Got allocated containers 1
2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Assigning container Container: [ContainerId: container_e84_1502793246072_73922_01_015726,
NodeId: bigdata-hdp-apache1960.xg01.diditaxi.com:8041, NodeHttpAddress: bigdata-hdp-apache1960.xg01.diditaxi.com:8042,
Resource: <memory:1024, vCores:1>, Priority: 5, Token: Token { kind: ContainerToken,
service: 10.93.111.36:8041 }, ] to fast fail map
2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Assigned from earlierFailedMaps
2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Assigned container container_e84_1502793246072_73922_01_015726 to attempt_1502793246072_73922_m_012103_5
2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Recalculating schedule, headroom=<memory:1727349, vCores:1703>
2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Reduce slow start threshold not met. completedMapsForReduceSlowstart 15564
2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:1009 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:2 AssignedReds:0
CompletedMaps:15563 CompletedReds:0 ContAlloc:15725 ContRel:26 HostLocal:4575 RackLocal:8121
{noformat}


{noformat}
!screenshot-1.png!
{noformat}




  was:
We encountered several jobs in the production environment due to the fact that some of the
NM unstable cause one *MAP* of the job to be stuck there, and the job can't finish properly.
However, the problems we encountered were different from those mentioned in [https://issues.apache.org/jira/browse/MAPREDUCE-6513].
 Because in our scenario, all of *MR REDUCEs* does not start executing.
But when I manually kill the hanged *MAP*, the job will be finished normally.


{noformat}
2017-08-17 12:25:06,548 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Reduce slow start threshold not met. completedMapsForReduceSlowstart 15564
2017-08-17 12:25:07,555 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Received completed container container_e84_1502793246072_73922_01_015700
2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Recalculating schedule, headroom=<memory:2218677, vCores:2225>
2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Reduce slow start threshold not met. completedMapsForReduceSlowstart 15564
2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 AssignedReds:0
CompletedMaps:15563 CompletedReds:0 ContAlloc:15723 ContRel:26 HostLocal:4575 RackLocal:8121
{noformat}


{noformat}
2017-08-17 14:49:41,793 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Before Scheduling: PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:1 AssignedReds:0
CompletedMaps:15563 CompletedReds:0 ContAlloc:15724 ContRel:26 HostLocal:4575 RackLocal:8121
2017-08-17 14:49:41,794 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor:
Applying ask limit of 1 for priority:5 and capability:<memory:1024, vCores:1>
2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor:
getResources() for application_1502793246072_73922: ask=1 release= 0 newContainers=0 finishedContainers=0
resourcelimit=<memory:1711989, vCores:1688> knownNMs=4236
2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Recalculating schedule, headroom=<memory:1711989, vCores:1688>
2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Reduce slow start threshold not met. completedMapsForReduceSlowstart 15564
2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Got allocated containers 1
2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Assigning container Container: [ContainerId: container_e84_1502793246072_73922_01_015726,
NodeId: bigdata-hdp-apache1960.xg01.diditaxi.com:8041, NodeHttpAddress: bigdata-hdp-apache1960.xg01.diditaxi.com:8042,
Resource: <memory:1024, vCores:1>, Priority: 5, Token: Token { kind: ContainerToken,
service: 10.93.111.36:8041 }, ] to fast fail map
2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Assigned from earlierFailedMaps
2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Assigned container container_e84_1502793246072_73922_01_015726 to attempt_1502793246072_73922_m_012103_5
2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Recalculating schedule, headroom=<memory:1727349, vCores:1703>
2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Reduce slow start threshold not met. completedMapsForReduceSlowstart 15564
2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:1009 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:2 AssignedReds:0
CompletedMaps:15563 CompletedReds:0 ContAlloc:15725 ContRel:26 HostLocal:4575 RackLocal:8121
{noformat}




> MR job got hanged forever when some NMs unstable for some time
> --------------------------------------------------------------
>
>                 Key: MAPREDUCE-6944
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6944
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: YunFan Zhou
>            Priority: Critical
>         Attachments: screenshot-1.png
>
>
> We encountered several jobs in the production environment due to the fact that some of
the NM unstable cause one *MAP* of the job to be stuck there, and the job can't finish properly.
> However, the problems we encountered were different from those mentioned in [https://issues.apache.org/jira/browse/MAPREDUCE-6513].
 Because in our scenario, all of *MR REDUCEs* does not start executing.
> But when I manually kill the hanged *MAP*, the job will be finished normally.
> {noformat}
> 2017-08-17 12:25:06,548 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Reduce slow start threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 12:25:07,555 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Received completed container container_e84_1502793246072_73922_01_015700
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Recalculating schedule, headroom=<memory:2218677, vCores:2225>
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Reduce slow start threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 12:25:07,556 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 AssignedReds:0
CompletedMaps:15563 CompletedReds:0 ContAlloc:15723 ContRel:26 HostLocal:4575 RackLocal:8121
> {noformat}
> {noformat}
> 2017-08-17 14:49:41,793 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Before Scheduling: PendingReds:1009 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:1 AssignedReds:0
CompletedMaps:15563 CompletedReds:0 ContAlloc:15724 ContRel:26 HostLocal:4575 RackLocal:8121
> 2017-08-17 14:49:41,794 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor:
Applying ask limit of 1 for priority:5 and capability:<memory:1024, vCores:1>
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor:
getResources() for application_1502793246072_73922: ask=1 release= 0 newContainers=0 finishedContainers=0
resourcelimit=<memory:1711989, vCores:1688> knownNMs=4236
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Recalculating schedule, headroom=<memory:1711989, vCores:1688>
> 2017-08-17 14:49:41,799 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Reduce slow start threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Got allocated containers 1
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Assigning container Container: [ContainerId: container_e84_1502793246072_73922_01_015726,
NodeId: bigdata-hdp-apache1960.xg01.diditaxi.com:8041, NodeHttpAddress: bigdata-hdp-apache1960.xg01.diditaxi.com:8042,
Resource: <memory:1024, vCores:1>, Priority: 5, Token: Token { kind: ContainerToken,
service: 10.93.111.36:8041 }, ] to fast fail map
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Assigned from earlierFailedMaps
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Assigned container container_e84_1502793246072_73922_01_015726 to attempt_1502793246072_73922_m_012103_5
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Recalculating schedule, headroom=<memory:1727349, vCores:1703>
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Reduce slow start threshold not met. completedMapsForReduceSlowstart 15564
> 2017-08-17 14:49:42,805 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:1009 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:2 AssignedReds:0
CompletedMaps:15563 CompletedReds:0 ContAlloc:15725 ContRel:26 HostLocal:4575 RackLocal:8121
> {noformat}
> {noformat}
> !screenshot-1.png!
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org


Mime
View raw message