hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-9552) FairScheduler: NODE_UPDATE can cause NoSuchElementException
Date Wed, 15 May 2019 19:06:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-9552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16840679#comment-16840679
] 

Hudson commented on YARN-9552:
------------------------------

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #16554 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/16554/])
YARN-9552. FairScheduler: NODE_UPDATE can cause NoSuchElementException. (gifuma: rev 55bd35921c2bb013e45120bbd1602b658b8b999b)
* (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFSAppAttempt.java
* (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java
* (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java


> FairScheduler: NODE_UPDATE can cause NoSuchElementException
> -----------------------------------------------------------
>
>                 Key: YARN-9552
>                 URL: https://issues.apache.org/jira/browse/YARN-9552
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>            Reporter: Peter Bacsko
>            Assignee: Peter Bacsko
>            Priority: Major
>             Fix For: 3.3.0
>
>         Attachments: YARN-9552-001.patch, YARN-9552-002.patch, YARN-9552-003.patch, YARN-9552-004.patch
>
>
> We observed a race condition inside YARN with the following stack trace:
> {noformat}
> 18/11/07 06:45:09.559 SchedulerEventDispatcher:Event Processor ERROR EventDispatcher:
Error in handling event type NODE_UPDATE to the Event Dispatcher
> java.util.NoSuchElementException
>         at java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036)
>         at java.util.concurrent.ConcurrentSkipListSet.first(ConcurrentSkipListSet.java:396)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.getNextPendingAsk(AppSchedulingInfo.java:373)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.isOverAMShareLimit(FSAppAttempt.java:941)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:1373)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:353)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:204)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1094)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:961)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1183)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:132)
>         at org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
>         at java.lang.Thread.run(Thread.java:748)
> {noformat}
> This is basically the same as the one described in YARN-7382, but the root cause is different.
> When we create an application attempt, we create an {{FSAppAttempt}} object. This contains
an {{AppSchedulingInfo}} which contains a set of {{SchedulerRequestKey}}. Initially, this
set is empty and only initialized a bit later on a separate thread during a state transition:
> {noformat}
> 2019-05-07 15:58:02,659 INFO  [RM StateStore dispatcher] recovery.RMStateStore (RMStateStore.java:transition(239))
- Storing info for app: application_1557237478804_0001
> 2019-05-07 15:58:02,684 INFO  [RM Event dispatcher] rmapp.RMAppImpl (RMAppImpl.java:handle(903))
- application_1557237478804_0001 State change from NEW_SAVING to SUBMITTED on event = APP_NEW_SAVED
> 2019-05-07 15:58:02,690 INFO  [SchedulerEventDispatcher:Event Processor] fair.FairScheduler
(FairScheduler.java:addApplication(490)) - Accepted application application_1557237478804_0001
from user: bacskop, in queue: root.bacskop, currently num of applications: 1
> 2019-05-07 15:58:02,698 INFO  [RM Event dispatcher] rmapp.RMAppImpl (RMAppImpl.java:handle(903))
- application_1557237478804_0001 State change from SUBMITTED to ACCEPTED on event = APP_ACCEPTED
> 2019-05-07 15:58:02,731 INFO  [RM Event dispatcher] resourcemanager.ApplicationMasterService
(ApplicationMasterService.java:registerAppAttempt(434)) - Registering app attempt : appattempt_1557237478804_0001_000001
> 2019-05-07 15:58:02,732 INFO  [RM Event dispatcher] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(920))
- appattempt_1557237478804_0001_000001 State change from NEW to SUBMITTED on event = START
> 2019-05-07 15:58:02,746 INFO  [SchedulerEventDispatcher:Event Processor] scheduler.SchedulerApplicationAttempt
(SchedulerApplicationAttempt.java:<init>(207)) - *** In the constructor of SchedulerApplicationAttempt
> 2019-05-07 15:58:02,747 INFO  [SchedulerEventDispatcher:Event Processor] scheduler.SchedulerApplicationAttempt
(SchedulerApplicationAttempt.java:<init>(230)) - *** Contents of appSchedulingInfo:
[]
> 2019-05-07 15:58:02,752 INFO  [SchedulerEventDispatcher:Event Processor] fair.FairScheduler
(FairScheduler.java:addApplicationAttempt(546)) - Added Application Attempt appattempt_1557237478804_0001_000001
to scheduler from user: bacskop
> 2019-05-07 15:58:02,756 INFO  [RM Event dispatcher] scheduler.AppSchedulingInfo (AppSchedulingInfo.java:updatePendingResources(257))
- *** Adding scheduler key: SchedulerRequestKey{priority=0, allocationRequestId=-1, containerToUpdate=null}
 for attempt: appattempt_1557237478804_0001_000001
> 2019-05-07 15:58:02,759 INFO  [RM Event dispatcher] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(920))
- appattempt_1557237478804_0001_000001 State change from SUBMITTED to SCHEDULED on event =
ATTEMPT_ADDED
> 2019-05-07 15:58:02,892 INFO  [main] impl.YarnClientImpl (YarnClientImpl.java:submitApplication(310))
- Submitted application application_1557237478804_0001
> {noformat}
> (some extra lines are printed with ***).
> So at 15:58:02,747 the set is empty and populated with a single element at 15:58:02,756
on "RM Event dispatcher". This means there's a tiny time window during which a {{NODE_UPDATE}}
can cause a {{NoSuchElementException}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message