aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ASF IRC Bot <asf...@urd.zones.apache.org>
Subject Summary of IRC Meeting in #aurora
Date Mon, 13 Oct 2014 18:45:43 GMT
Summary of IRC Meeting in #aurora at Mon Oct 13 18:02:25 2014:

Attendees: wickman, jcohen, wfarner, Yasumoto, kts, mkhutornenko, davelester, zmanji

- Preface
- Aurora doc day
- 0.6.0 release
- Test coverage flakiness
- External update coordination
- Ticket resolution field
- Health check snooze
  - Action: wfarner to report back to email thread with discussion
- Retiring the GC executor
- Security


IRC log follows:

## Preface ##
[Mon Oct 13 18:02:54 2014] <wfarner>: welcome, folks.  let's kick off with a roll call
[Mon Oct 13 18:02:55 2014] <wfarner>: here
[Mon Oct 13 18:02:56 2014] <mkhutornenko>: here
[Mon Oct 13 18:03:07 2014] <kts>: here
[Mon Oct 13 18:03:08 2014] <jcohen>: here
[Mon Oct 13 18:03:11 2014] <zmanji>: here
[Mon Oct 13 18:03:35 2014] <davelester>: present
## Aurora doc day ##
[Mon Oct 13 18:05:36 2014] <wfarner>: kts, davelester the floor is yours
[Mon Oct 13 18:05:43 2014] <Yasumoto>: woot
[Mon Oct 13 18:05:43 2014] <kts>: thanks wfarner
[Mon Oct 13 18:06:02 2014] <kts>: we're organizing a day to focus on improving aurora's
documentation
[Mon Oct 13 18:06:47 2014] <kts>: it's this Thursday, 16 Oct 2014 from 1000-1700 PDT
[Mon Oct 13 18:07:18 2014] <wickman>: here (womp)
[Mon Oct 13 18:07:32 2014] <kts>: we'll be coordinating in this channel, but if you
have anything you'd like to see documentation improved for please file a ticket now
[Mon Oct 13 18:07:52 2014] <kts>: some great examples of what those tickets look like:
[Mon Oct 13 18:07:54 2014] <kts>: AURORA-829
[Mon Oct 13 18:08:12 2014] <kts>: AURORA-828
[Mon Oct 13 18:08:27 2014] <kts>: make sure you mark your ticket with the JIRA componenent
"Documentation"
[Mon Oct 13 18:08:35 2014] <kts>: hope to see everyone there
[Mon Oct 13 18:08:50 2014] <davelester>: We currently have 19 unresolved issues w/ the
Documentation component https://issues.apache.org/jira/issues/?jql=project%20%3D%20AURORA%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Documentation%20ORDER%20BY%20priority%20DESC
## 0.6.0 release ##
[Mon Oct 13 18:09:51 2014] <wfarner>: AURORA-711
[Mon Oct 13 18:10:28 2014] <wfarner>: In the course of adding the new client syntax
and backend to the release goals, we've picked up a number of blocking tickets
[Mon Oct 13 18:11:03 2014] <wfarner>: I implore everyone with some spare cycles to pick
up one of these tickets to help us cross the finish line.
[Mon Oct 13 18:11:49 2014] <wfarner>: Please explicitly take ownership by self-assigning
what you think you can pick up.
## Test coverage flakiness ##
[Mon Oct 13 18:13:40 2014] <wfarner>: ~2 weeks ago, i added a check to the java build
to fail the build on different types of missing test coverage
[Mon Oct 13 18:14:16 2014] <wfarner>: There has been some difficult-to-pinpoint flakiness
with one check in particular - which makes sure that all classes have some test coverage
[Mon Oct 13 18:14:32 2014] <wfarner>: I believe this is now fixed, with AURORA-822
[Mon Oct 13 18:14:37 2014] <wfarner>: AURORA-822
[Mon Oct 13 18:14:58 2014] <wfarner>: If you see any more issues, please raise a ticket,
as i know consider the bug squashed.
## External update coordination ##
[Mon Oct 13 18:15:59 2014] <wfarner>: mkhutornenko: anything to follow up from the email
discussion on this topic?
[Mon Oct 13 18:16:37 2014] <mkhutornenko>: I would really like to hear more feedback
on that
[Mon Oct 13 18:16:39 2014] <wfarner>: context: http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/%3CCAOTkfX7x2oipk4ZFysoS0uWZRizOnKJA3y15pvEW5K4YnUHw-A%40mail.gmail.com%3E
[Mon Oct 13 18:17:08 2014] <mkhutornenko>: is it going to add any value, any changes
we should consider and etc.
[Mon Oct 13 18:17:23 2014] <wfarner>: ok - everyone please read that thread, speak now
or forever hold your peace
## Ticket resolution field ##
[Mon Oct 13 18:18:51 2014] <wfarner>: There's a gotcha with closing tickets right now
that i'm working to resolve.
[Mon Oct 13 18:19:04 2014] <wfarner>: This results in tickets being in status=Closed
with resolution=None.
[Mon Oct 13 18:19:21 2014] <mkhutornenko>: +1 caught me many times before
[Mon Oct 13 18:19:57 2014] <wfarner>: I believe this is an issue with the JIRA project
configuration.  For the time being, please be careful to avoid clicking buttons like 'Close'.
[Mon Oct 13 18:20:05 2014] <wfarner>: Instead prefer buttons that say 'Resolve'.
[Mon Oct 13 18:20:33 2014] <wfarner>: I hope to have this resolved this week, but as
i do not have JIRA admin access, i cannot guarantee this.
## Health check snooze ##
[Mon Oct 13 18:21:52 2014] <wfarner>: We had a review for a new feature move to a dev
list discussion last week.  Does anybody believe we did not achieve consensus on the approach?
[Mon Oct 13 18:22:00 2014] <wfarner>: https://reviews.apache.org/r/26383/
[Mon Oct 13 18:22:21 2014] <wfarner>: AURORA-795
[Mon Oct 13 18:22:27 2014] <zmanji>: There is a mailing list thread here: http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/%3CCACGrrVnLWDU=vEVAFt_QN0iL5C8OQ7pqae-3Ge5NNH6vJg4uGg@mail.gmail.com%3E
[Mon Oct 13 18:22:57 2014] <zmanji>: I don’t think we have a consenus yet so please
voice your opinion
[Mon Oct 13 18:23:00 2014] <wickman>: I think the consensus was "touch a snooze file,
then unlink after mtime + CONSTANT_TIMEOUT"
[Mon Oct 13 18:23:07 2014] <wickman>: is that not correct?
[Mon Oct 13 18:23:14 2014] <wfarner>: wickman: that was my understanding as well
[Mon Oct 13 18:23:42 2014] <wickman>: the other option is "touch a file, and the health
checker is disabled as long as that file is there."
[Mon Oct 13 18:23:43 2014] <kts>: I still feel that we should avoid being too clever
in our implementation here
[Mon Oct 13 18:23:44 2014] <jcohen>: yeah, it sounded to me like that’s what we were
coalescing on.
[Mon Oct 13 18:23:58 2014] <wickman>: the reason that I'm less in favor of that approach
is that it's not really a snooze -- it's a sleep, and could be prone to somebody forgetting
to turn it off
[Mon Oct 13 18:24:10 2014] <wickman>: which might be okay -- i think 99 times out of
100, people will be snoozing so they can get the state of a wedged task
[Mon Oct 13 18:24:13 2014] <wickman>: at which point they will kill when they're done
[Mon Oct 13 18:24:18 2014] <wickman>: so i think there's a reasonable argument either
way
[Mon Oct 13 18:24:24 2014] <wfarner>: yes, i'm torn
[Mon Oct 13 18:24:36 2014] <kts>: but we don't really know how long a tool will take
to get information about the wedged state
[Mon Oct 13 18:24:47 2014] <wickman>: kts: yeah, that's why #1 might be more appealing
[Mon Oct 13 18:24:58 2014] <jcohen>: kts: in that cause they could extend the snooze
by using touch -m?
[Mon Oct 13 18:25:05 2014] <wickman>: though you could just do (while true; do touch
.snooze; sleep 60; done;) &
[Mon Oct 13 18:25:46 2014] <jcohen>: I suppose it’s a question of what’s more likely
(or more concerning): will someone forget to remove a snooze, or forget to extend it
[Mon Oct 13 18:26:06 2014] <mkhutornenko>: +1 for not deleting the file. Avoiding FS
mutation == Less complexity == less things to go wrong
[Mon Oct 13 18:26:09 2014] <wickman>: i think it's important to look at why you'd want
to snooze in the first place
[Mon Oct 13 18:26:15 2014] <kts>: forget to extend means diagnostic information is lost
forever
[Mon Oct 13 18:26:18 2014] <wickman>: the only case i can think of is something in a
super weird state
[Mon Oct 13 18:26:27 2014] <wickman>: and they're almost always going to kill those
things in the weird state when they're done
[Mon Oct 13 18:26:32 2014] <wickman>: which would point to a permanent snooze
[Mon Oct 13 18:26:38 2014] <wfarner>: that was my feeling as well
[Mon Oct 13 18:28:17 2014] <wfarner>: should we reverse the position on this back to
no time awareness at all?
[Mon Oct 13 18:28:23 2014] <kts>: +1
[Mon Oct 13 18:28:31 2014] <mkhutornenko>: +1
[Mon Oct 13 18:28:40 2014] <zmanji>: +1
[Mon Oct 13 18:28:46 2014] <wfarner>: +1
[Mon Oct 13 18:29:02 2014] <zmanji>: wfarner: can you update that thread and review
with this information?
[Mon Oct 13 18:29:10 2014] <jcohen>: wickman is +1 on permanent snooze by proxy
[Mon Oct 13 18:29:11 2014] <wickman>: aye
[Mon Oct 13 18:29:22 2014] <wickman>: permanent snooze #freebandname
[Mon Oct 13 18:29:24 2014] <wfarner>: #action wfarner to report back to email thread
with discussion
## Retiring the GC executor ##
[Mon Oct 13 18:29:43 2014] <wfarner>: AURORA-715
[Mon Oct 13 18:30:22 2014] <wfarner>: jcohen you are leading the charge here.  i believe
there may be more tickets to create under that epic
[Mon Oct 13 18:30:27 2014] <wfarner>: do you feel you have a grasp for what is involved?
[Mon Oct 13 18:30:39 2014] <jcohen>: wickman and I began discussing this a bit today.
I still need to do a bit of research before I fully understand everything that needs to be
done.
[Mon Oct 13 18:30:48 2014] <wfarner>: great
[Mon Oct 13 18:31:08 2014] <wfarner>: can you (very) briefly summarize the moving parts
for those not in the know?
[Mon Oct 13 18:31:30 2014] <jcohen>: Not yet ;)
[Mon Oct 13 18:31:51 2014] <wfarner>: fair enough, please fill in the epic as you uncover
more.  we can revisit next week
[Mon Oct 13 18:32:08 2014] <wickman>: the tl;dr here is that only thermos_observer and
thermos will need a plugin to detect tasks via either the ExecutorDetector (path-based code)
or new code that talks to the local slave
[Mon Oct 13 18:32:20 2014] <jcohen>: Feel free to correct my possibly naive understanding,
but the GC executor is currently responsible for reconciling task state and cleaning up thermos
checkpoints
[Mon Oct 13 18:32:36 2014] <jcohen>: the task state reconciliation will be handled by
mesos
[Mon Oct 13 18:33:40 2014] <jcohen>: so we’ll need to fix things so checkpoints are
properly cleaned up w/o the GC executor as well as work out a way for the scheduler UI to
be properly notified
[Mon Oct 13 18:33:45 2014] <kts>: as will the cleaning of checkpoints as they'll be
moved into the sandbox
[Mon Oct 13 18:33:45 2014] <jcohen>: (hand wave)
[Mon Oct 13 18:34:04 2014] <kts>: and therefore within the purview of the slave's disk
gc
[Mon Oct 13 18:34:25 2014] <jcohen>: yes
[Mon Oct 13 18:34:33 2014] <wickman>: yeah, once the checkpoint root is set to be within
the mesos sandbox, we no longer need to be concerned about clean up anymore... just discoverability
via the thermos CLI and thermos observer
[Mon Oct 13 18:34:44 2014] <wickman>: the longer term plan for the thermos observer
is to deprecate it -- so if that's accelerated, that issue is moot
[Mon Oct 13 18:35:22 2014] <wfarner>: thanks for the context
[Mon Oct 13 18:35:30 2014] <wfarner>: That exhausts my topics, any others?
## Security ##
[Mon Oct 13 18:36:30 2014] <kts>: AURORA-720
[Mon Oct 13 18:37:25 2014] <kts>: I've written up a rough outline of proposed steps
to refactor the scheduler security code to use apache shiro
[Mon Oct 13 18:37:39 2014] <kts>: expressed as issues in that epic
[Mon Oct 13 18:38:21 2014] <kts>: tl;dr we would adopt Shiro and deprecate our custom
solution
[Mon Oct 13 18:38:49 2014] <kts>: of which there are currently no public implementations
that do anything
[Mon Oct 13 18:39:07 2014] <mkhutornenko>: kts: add that outline to AURORA-723?
[Mon Oct 13 18:39:14 2014] <mkhutornenko>: AURORA-723
[Mon Oct 13 18:39:22 2014] <wickman>: there are no public applications/products that
use Shiro?
[Mon Oct 13 18:39:39 2014] <kts>: wickman: no, that use our custom security framework
in aurora
[Mon Oct 13 18:39:39 2014] <wickman>: out of curiosioty, how old is Shiro?
[Mon Oct 13 18:39:43 2014] <wickman>: oh
[Mon Oct 13 18:39:43 2014] <wickman>: ok
[Mon Oct 13 18:40:15 2014] <wfarner>: http://en.wikipedia.org/wiki/Apache_Shiro
[Mon Oct 13 18:40:21 2014] <wfarner>: 1.0 since July 2010
[Mon Oct 13 18:40:29 2014] <wfarner>: latest release Feb 2014
[Mon Oct 13 18:40:42 2014] <kts>: and a fellow ASF project
[Mon Oct 13 18:41:00 2014] <kts>: anyway more details to come
[Mon Oct 13 18:41:13 2014] <wfarner>: thanks, kts
[Mon Oct 13 18:41:17 2014] <wfarner>: Last call for topics
[Mon Oct 13 18:41:52 2014] <kts>: AURORA-801
[Mon Oct 13 18:42:28 2014] <kts>: worth noting - if you've been running off master recently
you'll want to pick up the patch that fixes that issue
[Mon Oct 13 18:43:18 2014] <kts>: that's all I've got
[Mon Oct 13 18:45:17 2014] <wfarner>: ASFBot702: meeting stop


Meeting ended at Mon Oct 13 18:45:17 2014

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message