hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Omkar Vinit Joshi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-245) Node Manager can not handle duplicate responses
Date Fri, 19 Jul 2013 02:40:48 GMT

    [ https://issues.apache.org/jira/browse/YARN-245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13713280#comment-13713280
] 

Omkar Vinit Joshi commented on YARN-245:
----------------------------------------

Thanks [~mayank_bansal] for the patch.. I agree that checking heartbeat ids will test this
issue... few comments..
{code}
+      conf.setBoolean(YarnConfiguration.LOG_AGGREGATION_ENABLED, true);
{code}

why are we doing this?

{code}
+      NodeStatus nodeStatus = request.getNodeStatus();
+      nodeStatus.setResponseId(heartBeatID++);
{code}
required? can be removed?

* There is one issue at present with NodeStatusUpdaterImpl.java ...imagine if we get such
a heartbeat then we will not wait but try again.. check finally code {} which won't get executed.....
and will keep pinging RM until we get correct response with response-id. Should we wait or
immediately request? thoughts?

{code}
+        Thread.sleep(1000l);
{code}
can we make it 1000? .. 

* test will need timeout. however I see there are certain tests without timeout... if adding
timeout then add little larger value... :) 

{code}
+      if (nodeStatus.getKeepAliveApplications() != null
+          && nodeStatus.getKeepAliveApplications().size() > 0) {
+        for (ApplicationId appId : nodeStatus.getKeepAliveApplications()) {
+          List<Long> list = keepAliveRequests.get(appId);
+          if (list == null) {
+            list = new LinkedList<Long>();
+            keepAliveRequests.put(appId, list);
+          }
+          list.add(System.currentTimeMillis());
+        }
+      }
{code}
{code}
+      if (heartBeatID == 2) {
+        LOG.info("Sending FINISH_APP for application: [" + appId + "]");
+        this.context.getApplications().put(appId, mock(Application.class));
+        nhResponse.addAllApplicationsToCleanup(Collections.singletonList(appId));
+      }
{code}

{code}
+      rt.context.getApplications().remove(rt.appId);
{code} 

{code}
+    private Map<ApplicationId, List<Long>> keepAliveRequests =
+        new HashMap<ApplicationId, List<Long>>();
+    private ApplicationId appId = BuilderUtils.newApplicationId(1, 1);
{code}

do we need this? can we remove all application related stuff? as we are now checking only
heartbeat ids..we can remove this.. thoughts?
                
> Node Manager can not handle duplicate responses
> -----------------------------------------------
>
>                 Key: YARN-245
>                 URL: https://issues.apache.org/jira/browse/YARN-245
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>    Affects Versions: 2.0.2-alpha, 2.0.1-alpha
>            Reporter: Devaraj K
>            Assignee: Mayank Bansal
>         Attachments: YARN-245-trunk-1.patch, YARN-245-trunk-2.patch, YARN-245-trunk-3.patch
>
>
> {code:xml}
> 2012-11-25 12:56:11,795 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: FINISH_APPLICATION
at FINISHED
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:301)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:398)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:58)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:520)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:512)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>         at java.lang.Thread.run(Thread.java:662)
> 2012-11-25 12:56:11,796 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
Application application_1353818859056_0004 transitioned from FINISHED to null
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message