aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zameer Manji (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AURORA-1769) Enabling webhook is synchronous and could cause longer leader reelection cycle
Date Sat, 10 Sep 2016 03:28:20 GMT

    [ https://issues.apache.org/jira/browse/AURORA-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15479046#comment-15479046
] 

Zameer Manji commented on AURORA-1769:
--------------------------------------

[~maximk]: I don't think that's sufficient. In reality, doing any blocking in any event subscriber
will delay propagation of events. Apply the following patch to your repo:

{noformat}
diff --git c/examples/vagrant/upstart/aurora-scheduler.conf w/examples/vagrant/upstart/aurora-scheduler.conf
index 91b27d7..f7419d4 100644
--- c/examples/vagrant/upstart/aurora-scheduler.conf
+++ w/examples/vagrant/upstart/aurora-scheduler.conf
@@ -51,4 +51,5 @@ exec bin/aurora-scheduler \
   -mesos_role=aurora-role \
   -populate_discovery_info=true \
   -receive_revocable_resources=true \
-  -allow_gpu_resource=true
+  -allow_gpu_resource=true \
+  -webhook_config=/home/vagrant/aurora/src/main/resources/org/apache/aurora/scheduler/webhook.json
diff --git c/src/main/java/org/apache/aurora/scheduler/events/Webhook.java w/src/main/java/org/apache/aurora/scheduler/events/Webhook.java
index e54aa19..ed61ac0 100644
--- c/src/main/java/org/apache/aurora/scheduler/events/Webhook.java
+++ w/src/main/java/org/apache/aurora/scheduler/events/Webhook.java
@@ -13,6 +13,7 @@
  */
 package org.apache.aurora.scheduler.events;
 
+import com.google.common.base.Throwables;
 import java.io.DataOutputStream;
 import java.io.InputStream;
 import java.net.HttpURLConnection;
@@ -23,6 +24,8 @@ import com.google.common.eventbus.Subscribe;
 
 import com.google.inject.Inject;
 
+import org.apache.aurora.common.quantity.Amount;
+import org.apache.aurora.common.quantity.Time;
 import org.apache.aurora.scheduler.events.PubsubEvent.EventSubscriber;
 import org.apache.aurora.scheduler.events.PubsubEvent.TaskStateChange;
 import org.slf4j.Logger;
@@ -104,7 +107,11 @@ public class Webhook implements EventSubscriber {
    */
   @Subscribe
   public void taskChangedState(TaskStateChange stateChange) {
-    String eventJson = stateChange.toJson();
-    callEndpoint(eventJson);
+    int i = Amount.of(15, Time.SECONDS).as(Time.MILLISECONDS);
+    try {
+      Thread.sleep(i);
+    } catch (InterruptedException e) {
+      Throwables.propagate(e);
+    }
   }
 }

{noformat}

Then in vagrant create a job with 100 tasks.

Then restart the scheduler, you will see that it will never register within one minute because
the async worker for the event bus is busy blocked delivering {{TaskStateChange}} events.
You can see this by checking {{/threads}} and see the {{AsyncProcessor-*}} threads blocked
in the {{Webhook}} class.

Since calling an external HTTP server can block for an unknown amount of time, I think the
solution here is to make the hook async and have the event subscriber just place the event
in a queue for processing. Then it can have it's own thread pool for sending the requests
out.

> Enabling webhook is synchronous and could cause longer leader reelection cycle
> ------------------------------------------------------------------------------
>
>                 Key: AURORA-1769
>                 URL: https://issues.apache.org/jira/browse/AURORA-1769
>             Project: Aurora
>          Issue Type: Bug
>            Reporter: Dmitriy Shirchenko
>            Assignee: Dmitriy Shirchenko
>
> We had an issue where on scheduler leader reelection EventBus was full of TaskStateChange
events and caused scheduler to not be able to post DriverRegistered() message which caused
Aurora scheduler to not register within 1 minute. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message