aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ASF IRC Bot <>
Subject Summary of IRC Meeting in #aurora
Date Mon, 13 Jul 2015 18:51:36 GMT
Summary of IRC Meeting in #aurora at Mon Jul 13 18:01:44 2015:

Attendees: thalin, dnorris, wickman, jcohen, wfarner, Yasumoto, rafik, rdelvalle, bbrazil,
zmanji, dlester

- Preface
- 0.9.0 release update
- in-memory H2 store progress
- react.js experiment for scheduler UI
- Ubiqutious services (AURORA-1075)
- Resource allocation for cron jobs

IRC log follows:

## Preface ##
[Mon Jul 13 18:01:57 2015] <wfarner>: let's start with a roll call
[Mon Jul 13 18:01:59 2015] <wfarner>: here
[Mon Jul 13 18:02:08 2015] <jcohen>: afternoon!
[Mon Jul 13 18:02:09 2015] <dnorris>: here
[Mon Jul 13 18:02:15 2015] <rafik>: here
[Mon Jul 13 18:02:20 2015] <wickman>: ahoy
[Mon Jul 13 18:02:56 2015] <zmanji>: here
[Mon Jul 13 18:03:04 2015] <dlester>: here
[Mon Jul 13 18:03:34 2015] <Yasumoto>: howdy
[Mon Jul 13 18:03:55 2015] <thalin>: here
[Mon Jul 13 18:05:07 2015] <rdelvalle>: here
## 0.9.0 release update ##
[Mon Jul 13 18:05:40 2015] <wfarner>: AURORA-1078
[Mon Jul 13 18:06:13 2015] <wfarner>: we are in roughly the same state as last week
w.r.t. readiness to cut a release candidate
[Mon Jul 13 18:06:45 2015] <wfarner>: kts added AURORA-1352 as a blocker, and IIRC is
in-flight on fixing, but is currently on vacation
[Mon Jul 13 18:07:32 2015] <wfarner>: i assume the fix will be out for review tomorrow,
completing all tickets to cut a release candidate
## in-memory H2 store progress ##
[Mon Jul 13 18:09:57 2015] <wfarner>: for those who haven't been paying close attention,
we/i have been on an effort to migrate the scheduler's in-memory storage to use H2 and SQL
[Mon Jul 13 18:10:29 2015] <rafik>: What was the storage previously?
[Mon Jul 13 18:10:44 2015] <jcohen>: it currently uses the mesos replicated log
[Mon Jul 13 18:10:55 2015] <wfarner>: the in-memory layer has been hand-rolled, mostly
hash maps
[Mon Jul 13 18:11:02 2015] <rafik>: K
[Mon Jul 13 18:11:18 2015] <jcohen>: oh, derp, memory storage ;)
[Mon Jul 13 18:11:48 2015] <wfarner>: there's more background in this thread from last
[Mon Jul 13 18:12:46 2015] <wfarner>: most of the functional changes are now complete,
and i've done work to bring the performance up to suitable levels
[Mon Jul 13 18:13:11 2015] <wfarner>: the last bit of work is working out some concurrency
kinks, e.g. AURORA-1395
[Mon Jul 13 18:13:42 2015] <bbrazil>: wfarner: it uses the -hostname for web redirects,
and the local IP for cli tools
[Mon Jul 13 18:14:01 2015] <wfarner>: note that these issues don't apply to those using
default scheduler command line settings - the most critical stores will only be switched to
this system with a command line arg
## react.js experiment for scheduler UI ##
[Mon Jul 13 18:15:08 2015] <wfarner>: jcohen: the floor is yours
[Mon Jul 13 18:15:58 2015] <jcohen>: I’ve been doing some planning work on the future
of the Scheduler UI. There’s a lot that can be done to clean things up, both from a tech
debt perspective (no tests) and from a usability perspective.
[Mon Jul 13 18:16:19 2015] <jcohen>: the first step in that process has been to evaluate
alternatives to Angular.
[Mon Jul 13 18:16:36 2015] <jcohen>: As such I’ve been putting together a very simple
proof of concept demo using React
[Mon Jul 13 18:17:01 2015] <jcohen>: if anyone has any experience w/ React and Angular
and would like to discuss the pros/cons, feel free to let me know.
[Mon Jul 13 18:17:09 2015] <jcohen>: I can start a thread on dev@
[Mon Jul 13 18:17:25 2015] <rafik>: I'd be very interested in helping out with a React.js
[Mon Jul 13 18:17:27 2015] <jcohen>: Otherwise, I hope to push the demo some time this
[Mon Jul 13 18:17:44 2015] <rafik>: I've found the UI to be rather limited and non-performant
so far
[Mon Jul 13 18:18:02 2015] <jcohen>: I don’t have a ton of React experience, so it’s
a bit of a learning curve, but overall I’m liking what I’m seeing so far.
[Mon Jul 13 18:18:30 2015] <jcohen>: The main concern is whether a full rewrite is really
warranted to solve the underlying tech debt issues, or whether time is better spent simply
improving the existing Angular app.
[Mon Jul 13 18:19:21 2015] <jcohen>: I’ll send something out to dev@ when the demo
is available, and we can have that discussion then.
[Mon Jul 13 18:20:02 2015] <jcohen>: ACTION <eom>
[Mon Jul 13 18:20:26 2015] <wfarner>: rafik: awesome!  most committers are systems developers,
so we've been historically scant for web dev chops.  if you can help fill a void there it
would be awesome!
[Mon Jul 13 18:20:43 2015] <jcohen>: +1
[Mon Jul 13 18:20:58 2015] <rafik>: One question regarding the UI: Is it a requirement
that the UI currently be served by the Aurora scheduler?
[Mon Jul 13 18:21:11 2015] <jcohen>: That’s another topic I’ve broached internally
[Mon Jul 13 18:21:20 2015] <rafik>: Or are you open to having the UI run a separate
server using the Thrift (or future HTTP) api?
[Mon Jul 13 18:21:30 2015] <jcohen>: There’s certainly a benefit to doing so, as it’s
one less component to deploy.
[Mon Jul 13 18:21:53 2015] <jcohen>: But being able to use, e.g., node.js to serve up
the UI as an isomorphic app also has its benefits.
[Mon Jul 13 18:22:17 2015] <jcohen>: In any event, I don’t think there are any sacred
cows as far as the UI is concerned.
[Mon Jul 13 18:22:22 2015] <bbrazil>: the last time it came up the general concensus
was that separate was fine for prototyping, but most people wanted it as part of the scheduler
[Mon Jul 13 18:22:28 2015] <wfarner>: rafik: it's definitely not a requirement.  even
today, that slice of the scheduler is really  just asset serving + API calls, so it is technically
[Mon Jul 13 18:22:48 2015] <rafik>: Maybe an in-between where you package the UI as
a directory of assets but allow the directory to be overriden
[Mon Jul 13 18:22:53 2015] <jcohen>: one alternative might be to allow optionally serving
up a separate UI from the default scheduler UI (or a command line flag to entirely suppress
the scheduler serving a UI at all)
[Mon Jul 13 18:22:55 2015] <wfarner>: but +1 to fewer parts for default deployment
[Mon Jul 13 18:23:05 2015] <rafik>: Consul supports a `-ui-dir` argument to update/deploy
the UI portions separately
[Mon Jul 13 18:23:34 2015] <rafik>: And I've found that to be quite useful in testing
out changes
[Mon Jul 13 18:24:01 2015] <bbrazil>: we do similarly for prometheus itself for development
[Mon Jul 13 18:24:02 2015] <rafik>: See "Self-hosted Dashboard":
[Mon Jul 13 18:24:13 2015] <jcohen>: seems reasonable
[Mon Jul 13 18:24:42 2015] <wfarner>: ah, yes - i would like that as well.  i did some
work a while back to get us close to that - all assets now live in a single path on the classpath
[Mon Jul 13 18:25:02 2015] <zmanji>: I also would like that as well. It would allow
for cluster operators to make zone specific ui changes if requried
[Mon Jul 13 18:25:17 2015] <wfarner>: the only kink is that they don't live update,
but that shouldn't be too hard to resolve
[Mon Jul 13 18:26:34 2015] <wfarner>: sounds like there's continued interest here. 
i suggest that jcohen and rafik continue offline, and jcohen starts a dev@ thread shortly
[Mon Jul 13 18:26:48 2015] <jcohen>: sounds good to me
[Mon Jul 13 18:26:59 2015] <rafik>: Same
[Mon Jul 13 18:27:24 2015] <wfarner>: last call for topics, i'll otherwise close in
3 mins
## Ubiqutious services (AURORA-1075) ##
[Mon Jul 13 18:28:26 2015] <wfarner>: rafik: floor is yours
[Mon Jul 13 18:29:10 2015] <rafik>: I'm wondering if there's been any update on the
proposal that Anindya has opened
[Mon Jul 13 18:29:29 2015] <jcohen>: AURORA-1075
[Mon Jul 13 18:29:31 2015] <rafik>: Link to proposal here:
[Mon Jul 13 18:30:10 2015] <rafik>: We're interested in supporting this for our installation
[Mon Jul 13 18:30:26 2015] <rafik>: But don't want to invest too much time in exploring
it if there's independent momentum
[Mon Jul 13 18:30:53 2015] <rafik>: Also not sure that we have resources that could
contribute it yet, especially as it seems to require quite a few changes across existing pieces
[Mon Jul 13 18:31:06 2015] <wfarner>: i have not heard of anyone writing code related
to that proposal
[Mon Jul 13 18:31:33 2015] <zmanji>: the proposal also has some (unresolved?) issues
in the coments
[Mon Jul 13 18:32:10 2015] <wfarner>: i will take a pass through the document today
to see if i can help drive any of those to resolution.
[Mon Jul 13 18:32:24 2015] <rafik>: Thanks wfarner, that would be great
[Mon Jul 13 18:32:41 2015] <rafik>: <eom>
[Mon Jul 13 18:32:46 2015] <rafik>: One more topic:
## Resource allocation for cron jobs ##
[Mon Jul 13 18:33:11 2015] <rafik>: This came up on Thursday evening, but we didn't
seem to get a complete answer
[Mon Jul 13 18:33:23 2015] <rafik>: What's the story re: reserving quota for cron jobs?
[Mon Jul 13 18:33:47 2015] <rafik>: We have quite a few cron jobs that run very infrequently
and are currently reserving useful quota
[Mon Jul 13 18:34:03 2015] <wickman>: i'm with rafik: why?  imho i think quota should
be deducted at runtime for all tasks, including those spawned by crons.
[Mon Jul 13 18:34:04 2015] <rafik>: Is there any interest in rethinking the quota reservation
for cron jobs?
[Mon Jul 13 18:34:12 2015] <zmanji>: It used to be not this case
[Mon Jul 13 18:34:15 2015] <zmanji>: but maxim reverted it
[Mon Jul 13 18:34:21 2015] <wickman>: bring back PENDING: insufficient quota
[Mon Jul 13 18:34:34 2015] <wfarner>: that was never a thing
[Mon Jul 13 18:34:37 2015] <jcohen>: Cron jobs used to run regardless of quota, right?
[Mon Jul 13 18:34:50 2015] <wfarner>: quota checks have always been at job submission
[Mon Jul 13 18:34:54 2015] <zmanji>: jcohen: yes and that was a hole because then a
role could consume more production resources than its quota
[Mon Jul 13 18:34:59 2015] <jcohen>: right
[Mon Jul 13 18:35:01 2015] <rafik>: Right, but cron scheduling isn't job submission
[Mon Jul 13 18:35:12 2015] <rafik>: My understanding is that crons are scheduled and
then the Aurora scheduler submits jobs for them later
[Mon Jul 13 18:35:23 2015] <rafik>: Shouldn't the quota check happen then? as if it
were an ad-hoc job?
[Mon Jul 13 18:35:48 2015] <wfarner>: rafik: i would support that
[Mon Jul 13 18:36:15 2015] <wfarner>: though we don't really have a means to give feedback
about that right now
[Mon Jul 13 18:36:16 2015] <rafik>: It sounds like that used to be existing behavior?
[Mon Jul 13 18:36:23 2015] <rafik>: Is it just a matter of reverting a change?
[Mon Jul 13 18:36:26 2015] <wfarner>: nope, that behavior never existed
[Mon Jul 13 18:36:48 2015] <rafik>: Re: feedback. You mean give feedback to operators
that their jobs aren't running?
[Mon Jul 13 18:37:15 2015] <wfarner>: correct, we currently lack a "pending job" concept,
only pending tasks
[Mon Jul 13 18:37:45 2015] <rafik>: Is that true? I see "Pending: Insufficient CPU",
etc. for jobs all the time in the UI
[Mon Jul 13 18:37:53 2015] <bbrazil>: I'd argue that sort of feedback should be from
your monitoring system in the first instance, though the UI should expose something about
[Mon Jul 13 18:37:54 2015] <wfarner>: right - that's for tasks
[Mon Jul 13 18:38:26 2015] <rafik>: wfarner: not sure I understand the difference
[Mon Jul 13 18:38:30 2015] <wfarner>: say we were to launch cron jobs and hold its tasks
in PENDING due to insufficient quota.  what does that mean for service tasks when they restart?
[Mon Jul 13 18:38:46 2015] <wfarner>: do service tasks also wait because a cron iteration
is holding up resources?
[Mon Jul 13 18:39:13 2015] <rafik>: I would prefer they do, yes
[Mon Jul 13 18:39:19 2015] <bbrazil>: I'd expect quota to be at a job level, so tasks
restarting/updating wouldn't be affected
[Mon Jul 13 18:39:20 2015] <rafik>: Assuming the crons are marked `production`, etc.
[Mon Jul 13 18:39:38 2015] <bbrazil>: a new job may not be accepted in that case, or
updates changing resource usage
[Mon Jul 13 18:39:45 2015] <wickman>: wfarner: FIFO queue
[Mon Jul 13 18:39:48 2015] <wickman>: wfarner: of pending tasks
[Mon Jul 13 18:39:57 2015] <rafik>: This really depends on how people are using cron,
but in our particular use case, we have cron jobs that are related to long-running services
[Mon Jul 13 18:40:11 2015] <rafik>: It's been brought up before, but some concept of
job "groups" may actually go towards resolving this
[Mon Jul 13 18:40:23 2015] <rafik>: E.g. you can say my offline payments service has
these 5 cron jobs that need to run
[Mon Jul 13 18:40:41 2015] <rafik>: In that case, Aurora could do something like reserve
the maximum of the cron job resources
[Mon Jul 13 18:40:49 2015] <rafik>: And only allow one job to run at a time for instance
[Mon Jul 13 18:40:59 2015] <rafik>: Obviously not applicable in all circumstances
[Mon Jul 13 18:41:12 2015] <bbrazil>: I could see that for a dependency setup, not sure
about the more general case you're proposing
[Mon Jul 13 18:41:14 2015] <wfarner>: yeah, and could make things even harder to reason
[Mon Jul 13 18:41:18 2015] <rafik>: But some concept of pooling together crons so that
they would share resources might be useful
[Mon Jul 13 18:41:47 2015] <wfarner>: rafik: i definitely agree that as a user i should
be able to deliberately stagger my cron jobs to time-share quota
[Mon Jul 13 18:41:49 2015] <rafik>: Okay, perhaps better to table the job group discussion
for now then
[Mon Jul 13 18:42:10 2015] <rafik>: Yeah, for background ~60% of our quota is reserved
by cron jobs right now
[Mon Jul 13 18:42:18 2015] <rafik>: Most of which only run on the order of once a day,
or once a week
[Mon Jul 13 18:42:47 2015] <rafik>: Aurora could make some attempt to reserve cron based
on the job schedules
[Mon Jul 13 18:43:04 2015] <rafik>: I.e. recognize crons as being non-overlapping
[Mon Jul 13 18:43:13 2015] <rafik>: But that assumes some knowledge of their run-time,
I suppose
[Mon Jul 13 18:43:21 2015] <wfarner>: right
[Mon Jul 13 18:43:38 2015] <bbrazil>: and gets more complicated if something else is
using the resources it wants
[Mon Jul 13 18:43:46 2015] <rafik>: Right
[Mon Jul 13 18:43:53 2015] <wfarner>: IMHO a pending job submission is the easiest to
think about from an operator and user perspective
[Mon Jul 13 18:44:04 2015] <bbrazil>: +1
[Mon Jul 13 18:44:05 2015] <wfarner>: (not to be confused with a pending task)
[Mon Jul 13 18:44:06 2015] <rafik>: +1
[Mon Jul 13 18:44:26 2015] <wickman>: really, why pending job?
[Mon Jul 13 18:44:46 2015] <wickman>: and not just PENDING: insufficient quota + a FIFO
[Mon Jul 13 18:45:05 2015] <wickman>: are you worried about reduced availability of
flapping service tasks?
[Mon Jul 13 18:45:10 2015] <wickman>: that's what priority is for
[Mon Jul 13 18:45:23 2015] <wfarner>: wickman: good point w.r.t. priority
[Mon Jul 13 18:46:04 2015] <bbrazil>: I think that job admission control should be separate
from task scheduling and restart handling
[Mon Jul 13 18:46:12 2015] <wfarner>: checking the code, the scheduler does appropriately
use priority within a role
[Mon Jul 13 18:47:46 2015] <wfarner>: another behavior supporting bbrazil's statement
- the current approach is immune to a cluster administrator fat-fingering a user's quota
[Mon Jul 13 18:48:24 2015] <wfarner>: if quota is considered during task scheduling,
there's a larger potential impact
[Mon Jul 13 18:48:47 2015] <wfarner>: though this could also be argued for checking
quota while kicking off a cron run
[Mon Jul 13 18:48:48 2015] <bbrazil>: it'd also require providing more quota to be safe
to handle task scheduling
[Mon Jul 13 18:49:24 2015] <wickman>: i'm of the opposite belief -- we should go even
further and evaluate quota for running tasks every time quota changes
[Mon Jul 13 18:49:33 2015] <wickman>: in other words, reducing quota can actually preempt
tasks and make them go PENDING: insufficient quota
[Mon Jul 13 18:49:51 2015] <bbrazil>: you want the quota given to directly be what you
want the user to be able to use, if the administrator has to add safety/fudge factors that
makes resource less manageable
[Mon Jul 13 18:51:00 2015] <wfarner>: we're running long on the meeting.  rafik: i suggest
you carry this discussion to dev@ so that we may continue offline
[Mon Jul 13 18:51:12 2015] <rafik>: Sure
[Mon Jul 13 18:51:26 2015] <wfarner>: closing up now, thanks for the interesting discussions,
[Mon Jul 13 18:51:29 2015] <wfarner>: ASFBot: meeting stop

Meeting ended at Mon Jul 13 18:51:29 2015

  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message