hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-4144) ResourceManager NPE while handling NODE_UPDATE
Date Thu, 12 Apr 2012 21:09:19 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-4144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jason Lowe updated MAPREDUCE-4144:
----------------------------------

    Attachment: MAPREDUCE-4144-testcase.patch

I think the fix for MAPREDUCE-3005 is being skipped due to reserved containers.  Here's the
scenario:

# Application has some requests that are NODE_LOCAL and some others that are ANY.
# Node A heartbeats in and we try to schedule the NODE_LOCAL request on it, but there are
no available containers and instead we make a reservation.
# Node B heartbeats in and it's on the same rack as Node A, so we fulfill the corresponding
RACK_LOCAL request that went with Node A's NODE_LOCAL request.
# Node A heartbeats in with some spare containers, and we skip the MAPREDUCE-3005 fix in canAssign()
because there is a reserved container on this node.  Since the RACK_LOCAL request was removed
when we assigned it to Node B, we crash because we assume all NODE_LOCAL requests will have
a corresponding RACK_LOCAL request.

I checked the RM log above the crash, and I did find indications of container reservations
being in play.  For example:

{noformat}
 [ResourceManager Event Processor]2012-04-12 02:09:01,671 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Trying to fulfill re
servation for application application_1334157153376_0281 on node: xxx:8041
 [ResourceManager Event Processor]2012-04-12 02:09:01,671 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApp:
Application application_1334157153
376_0281 unreserved  on node host: xxx:8041 #containers=3 available=7680 used=13824, currently
has 70 at priority org.apache.hadoop.yarn.api
.records.impl.pb.PriorityPBImpl@33; currentReservation memory: 322560
{noformat}

Attached is a testcase that reproduces the NPE crash with the same backtrace.
                
> ResourceManager NPE while handling NODE_UPDATE
> ----------------------------------------------
>
>                 Key: MAPREDUCE-4144
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4144
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>            Reporter: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-4144-testcase.patch
>
>
> The RM on one of our clusters has exited twice in the past few days because of an NPE
while trying to handle a NODE_UPDATE:
> {noformat}
> 2012-04-12 02:09:01,672 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Error in handling event type NODE_UPDATE to the scheduler
>  [ResourceManager Event Processor]java.lang.NullPointerException
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocateNodeLocal(AppSchedulingInfo.java:261)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:223)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApp.allocate(SchedulerApp.java:246)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainer(LeafQueue.java:1229)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignNodeLocalContainers(LeafQueue.java:1078)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainersOnNode(LeafQueue.java:1048)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignReservedContainer(LeafQueue.java:859)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:756)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:573)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:622)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:78)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:302)
>         at java.lang.Thread.run(Thread.java:619)
> {noformat}
> This is very similar to the failure reported in MAPREDUCE-3005.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message