aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashwin Murthy (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AURORA-1493) create ELB-friendly endpoint to detect leading scheduler
Date Wed, 30 Mar 2016 23:21:25 GMT

    [ https://issues.apache.org/jira/browse/AURORA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15219051#comment-15219051
] 

Ashwin Murthy commented on AURORA-1493:
---------------------------------------

Hi Bill, 

Thanks for your help! My current thoughts:

1. Create a new http endpoint called /IsLeader. Add this to LEADER_ENDPOINTS
   in JettyServerModule.java. 

2. Create corresponding servlet class (similar to Locks.java). 

3. implement a GET method where I can do something like:

Optional<HostAndPort> leaderHttp = getLeaderHttp();
Optional<HostAndPort> localHttp = getLocalHttp();

if (leaderHttp.isPresent() && leaderHttp.equals(localHttp)) {
  return LeaderStatus.LEADING;
}

4. This is similar to what is the the LeaderRedirect::getRedirectStatus 

5. If leader, return 200, else 503. 

What do you think?

Bill Farner
Mar 24 (6 days ago)

to me 
>From the ELB docs on health checks, i see this:

For HTTP/HTTPS, you must include a ping path in the string. HTTP is specified as a HTTP:port;/;PathToPing;
grouping, for example "HTTP:80/weather/us/wa/seattle". In this case, a HTTP GET request is
issued to the instance on the given port and path. Any answer other than "200 OK" within the
timeout period is considered unhealthy.

This makes me wonder if the desired behavior is already there with any endpoints running through
LeaderRedirectFilter (which includes LEADER_ENDPOINTS).  Those endpoints only return 200 if
the instance is leading.

Can you double-check if we're already in good shape?


Ashwin Murthy <ashwinmurthy@gmail.com>
Mar 25 (5 days ago)

to Bill 
OK. I will test this in our env by settting up the health checks to go against one of the
LEADER_ENDPOINTS and seeing if this works. 

Thanks Bill!


Ashwin Murthy <ashwinmurthy@gmail.com>
Mar 25 (5 days ago)

to Bill 
Hi Bill, 

So this is what I see and I think it kind of aligns with my understanding of how things might
work. When you issue any http request on any of the non-leading schedulers, a temporary redirect
307 is sent. the location header contains the leader's host:port and path of the original
request. The http client/browser will reconnect. This is what I see happen in our prod aurora
env. After I issue this in the browser, I see the browser load the redirected page from the
leader.

====================
Request URL:http://<non-leader-hostname>:8082/slaves
Request Method:GET
Status Code:307 Temporary Redirect
Remote Address:127.0.0.1:8127
Response Headers
view source
Content-Length:0
Date:Fri, 25 Mar 2016 23:50:48 GMT
Location:http://10.162.9.54:8082/slaves
Server:Jetty(9.3.6.v20151106)
Request Headers
view source
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip, deflate, sdch
Accept-Language:en-US,en;q=0.8
Connection:keep-alive
Host:<non-leader-hostname>:8082
Referer: <non-leader-hostname>:8082/
Upgrade-Insecure-Requests:1
User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/49.0.2623.87 Safari/537.36

I think we might need to add a new EP (say  /isLeader) which is actually not part of the redirect
filter and then return 200 or 500 accordingly. What do you think?


Bill Farner
Mar 25 (5 days ago)

to me 
That behavior matches my understand, but according to the ELB docs, that should work for a
health check (200=healthy, non-200=unhealthy).  Did you find otherwise in ELB behavior, or
contradicting docs?


Ashwin Murthy <ashwinmurthy@gmail.com>
Mar 25 (5 days ago)

to Bill 
Ah, i haven' tested the non-200 health check per se on ELB. But I can test this out in our
prod env which uses HAProxy for health check. Our load balancer team did say a 500 level error
code. But let me confirm.

But thinking about this more. Even if ELB treats 307 as unhealthy health check, this seems
kind of a hack to me. It is possible that other load balancers infact honor the redirect.
I used to work in Azure networking before Uber and I know their L7 LB was planning support
to handle redirect. 

>From a http perspective, it might be better to send 500 level.


Bill Farner
Mar 25 (5 days ago)

to me 
Looks like HAproxy allows you to specify the expected status code.

An HTTP load balancer following redirects sounds pretty bizarre, but I've seen stranger things
:-)

At any rate, I'm cool with a /leaderhealth endpoint that is 200/503 based on leading status.
 Let me know if that's what you want to do, and if you need any pointers to get going.


Ashwin Murthy <ashwinmurthy@gmail.com>
Mar 28 (2 days ago)

to Bill 
Hi Bill, 

I will go ahead and add this. does my proposed set of changes in this thread (earlier) sound
about right?


Bill Farner
Mar 29 (1 day ago)

to me 
Yup, that's how you should approach.

> create ELB-friendly endpoint to detect leading scheduler
> --------------------------------------------------------
>
>                 Key: AURORA-1493
>                 URL: https://issues.apache.org/jira/browse/AURORA-1493
>             Project: Aurora
>          Issue Type: Task
>          Components: Scheduler, Usability
>            Reporter: brian wickman
>            Assignee: Ashwin Murthy
>
> iiuc hitting the web ui for non-leading schedulers redirects to the leader.  this doesn't
really help the members of the ensemble are not publicly routable.
> if there was a /leader endpoint that returned "200 OK" if it is leader and some 3xx/4xx
code if not, then it would be easier to configure an ELB to route traffic to the correct leader,
simplifying the use of aurora in an AWS deployment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message