hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikas Saha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-417) Add a poller that allows the AM to receive notifications when it is assigned containers
Date Mon, 04 Mar 2013 23:05:14 GMT

    [ https://issues.apache.org/jira/browse/YARN-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13592777#comment-13592777
] 

Bikas Saha commented on YARN-417:
---------------------------------

Calling client.registerApp() before client.start() and client.stop() before client.unregister()
is not in line with the Service interface. Services need to be started before used and stopped
after using them. Also, adding a number of services as part of a composite service is a common
pattern. In that, all services are added, inited, started, used and then stopped . The composite
service takes care of ordering between services. In such use cases, it may not possible to
call interface methods out of order as is being done here. We could enhance the heartbeater
to not heartbeat until register is called. or we could start the heartbeater after registration
is complete. The latter approach makes more sense to me.

I am surprised that the DistShell code is calling resourceManager.stop() and then resourceManager.unregister()
because stop() eventually call AMRMClientImpl.stop() that shuts down the proxy. After that,
unregister() call on AMRMClientImpl should fail.

Why are we calling client.start() in the init() method and not at the beginning of the start
method()? Perhaps related to the above comment.
{code}
+  @Override
+  public void init(Configuration conf) {
+    super.init(conf);
+    client.init(conf);
+    client.start();
+  }
{code}

Why not wait for the handlerThread to join()? The comment does not match the code for the
heartbeat thread.
{code}
+  /**
+   * Tells the heartbeat thread to stop, but does not wait for it to return.
+   */
+  @Override
+  public void stop() {
+    client.stop();
+    keepRunning = false;
+    try {
+      heartbeatThread.join();
+    } catch (InterruptedException ex) {
+      LOG.error("Error joining with heartbeat thread", ex);
+    }
+    handlerThread.interrupt();
+  }
{code}

In general, it would be good to spend some thought on the thread safety of the new class.
Both external calls from the app and the internal producer/consumer race between the heartbeat
and callback threads. During startup, execution and shutdown. I havent thought through them
but the almost complete absence of any synchronization made be wonder if it was by design.
I would prefer queue.put() which blocks on capacity instead of queue.add() to mirror queue.take().

Could save some time using wait/notify? Important for end to end tests time.
{done}
+    while (!done) {
+      try {
+        Thread.sleep(1000);
+      } catch (InterruptedException ex) {}
+    }
{done}

Looks like this is only for tests. If yes, how about making it package private and annotating
with @Private and @VisibleForTesting.
{code}
+  public AMRMClientAsync(AMRMClient client, int intervalMs,
+      CallbackHandler callbackHandler) {
{code}

A committer once told me that the philosophy behind BuilderUtils it to pass all members of
the object being built and use it as a completely defined constructor so that folks dont miss
passing any member fields by accident. So I guess nodeUpdates and reboot should also be passed
in as arguments.
{code}
+  public static AMResponse newAMResponse(
+      List<ContainerStatus> completedContainers,
+      List<Container> allocatedContainers) {
{code}

I would like the test code to not exemplify incorrect use of the class. The test is calling
allocate without call register and it all works. Maybe if we fixed the first comment in this
review then it wont allow such incorrect usage. Secondly, folks tend to look at test code
to see usage of a class and so showing incorrect usage is not a good idea IMO.
{code}
+    AMRMClientAsync asyncClient = new AMRMClientAsync(client, 200, callbackHandler);
+    asyncClient.init(conf);
+    asyncClient.start();
+    
+    while (callbackHandler.takeAllocatedContainers() == null) {
+
{code}

This code can lead to a flaky test. If I understand the flow correctly the following can happen.
CallbackHandler populates allocatedContainers and OS pauses it. In the meanwhile heartbeater
has already given completedContainers. The main thread then takesAllocatedContainers and it
pauses. The CallbackHandler then returns and onCompletedContainers() is called which populates
completed containers. Then it pauses. The main thread executes takeCompletedContainers() which
returns non-null and the Assert fails. Is this a correct understanding? If yes, we should
make sure that the test does not end up being flaky. In general sleep() should be avoided
because it makes tests slow and tend to be flaky. I agree in some case, sleep is hard to avoid
when the test is running an inline service whose timing we cannot control or when the effort
to do so is too large. But in this case where all the code is test code or mock code, we could
avoid sleeping.
{code}
+    while (callbackHandler.takeAllocatedContainers() == null) {
+      // allocated containers should come before completed containers
+      Assert.assertEquals(null, callbackHandler.takeCompletedContainers());
+      Thread.sleep(100);
+    }
{code}

An interesting test would be one that asserts that the heartbeats continue to happen when
the callback is called. Its an important feature we explicitly designed for, right?


                
> Add a poller that allows the AM to receive notifications when it is assigned containers
> ---------------------------------------------------------------------------------------
>
>                 Key: YARN-417
>                 URL: https://issues.apache.org/jira/browse/YARN-417
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: api, applications
>    Affects Versions: 2.0.3-alpha
>            Reporter: Sandy Ryza
>            Assignee: Sandy Ryza
>         Attachments: AMRMClientAsync-1.java, AMRMClientAsync.java, YARN-417-1.patch,
YARN-417-2.patch, YARN-417-3.patch, YARN-417.patch, YarnAppMaster.java, YarnAppMasterListener.java
>
>
> Writing AMs would be easier for some if they did not have to handle heartbeating to the
RM on their own.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message