aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Santhosh Kumar Shanmugham <santhoshkuma...@gmail.com>
Subject Re: Review Request 51874: Change framework_name default value from 'TwitterScheduler' to 'aurora'
Date Wed, 14 Sep 2016 20:39:15 GMT


> On Sept. 13, 2016, 5:11 p.m., Zameer Manji wrote:
> > I support this change as a developer.
> > 
> > As an operator I am scared.
> > 
> > What happens to an existing cluster if we don't set `framework_name`? Will it register
another frameowork_id? (bad) or will it fail to register? (better).
> 
> Santhosh Kumar Shanmugham wrote:
>     The restarting framework will be treated like a scheduler fail-over.
> 
> Zameer Manji wrote:
>     The release notes in this patch says
>     > Update default value of command line option `-framework_name` to 'aurora'. Please
be aware that
>       depending on your usage of Mesos, this will be a backward incompatible change.
>       
>     I'm trying to understand the implications of the backwards incompatability. Will
the scheduler fail to register or will it register under a new frameworkid (and then lose
track of previous tasks?)
> 
> Joshua Cohen wrote:
>     Santhosh, did you verify this in vagrant with a scheduler that already had tasks
running? If it is backwards compatible then we can probably adjust the release notes?
> 
> Santhosh Kumar Shanmugham wrote:
>     Results from testing in Vagrant cluster,
>     
>     Renaming framework from 'TwitterScheduler' to 'Aurora':
>     
>     The framework re-registers after restart (treated by master as failover) and gets
the same framework-id and performs task reconciliation thereby restoring the tasks.
>     
>     I0914 16:48:28.408182  9815 master.cpp:1297] Giving framework 071c44a1-b4d4-4339-a727-03a79f725851-0000
(TwitterScheduler) at scheduler-75517c8f-5913-49e9-8cc4-342a78c9bbcb@192.168.33.7:8083 3weeks
to failover
>     I0914 16:48:28.408226  9815 hierarchical.cpp:382] Deactivated framework 071c44a1-b4d4-4339-a727-03a79f725851-0000
>     E0914 16:48:28.408617  9819 process.cpp:2105] Failed to shutdown socket with fd 28:
Transport endpoint is not connected
>     I0914 16:48:43.722126  9813 master.cpp:2424] Received SUBSCRIBE call for framework
'Aurora' at scheduler-dfad8309-de4b-47d8-a8f8-82828ea40a12@192.168.33.7:8083
>     I0914 16:48:43.722190  9813 master.cpp:2500] Subscribing framework Aurora with checkpointing
enabled and capabilities [ REVOCABLE_RESOURCES, GPU_RESOURCES ]
>     I0914 16:48:43.722225  9813 master.cpp:2564] Updating info for framework 071c44a1-b4d4-4339-a727-03a79f725851-0000
>     I0914 16:48:43.722256  9813 master.cpp:2577] Framework 071c44a1-b4d4-4339-a727-03a79f725851-0000
(Aurora) at scheduler-75517c8f-5913-49e9-8cc4-342a78c9bbcb@192.168.33.7:8083 failed over
>     I0914 16:48:43.722429  9813 hierarchical.cpp:348] Activated framework 071c44a1-b4d4-4339-a727-03a79f725851-0000
>     I0914 16:48:43.722595  9813 master.cpp:5709] Sending 1 offers to framework 071c44a1-b4d4-4339-a727-03a79f725851-0000
(Aurora) at scheduler-dfad8309-de4b-47d8-a8f8-82828ea40a12@192.168.33.7:8083
>     I0914 16:49:44.204677  9812 master.cpp:5447] Performing explicit task state reconciliation
for 1 tasks of framework 071c44a1-b4d4-4339-a727-03a79f725851-0000 (Aurora) at scheduler-dfad8309-de4b-47d8-a8f8-82828ea40a12@192.168.33.7:8083
>     
>     Rolling back framework name to 'TwitterScheduler' from 'Aurora':
>     
>     Same here.
>     
>     I0914 16:51:33.203495  9812 master.cpp:1297] Giving framework 071c44a1-b4d4-4339-a727-03a79f725851-0000
(Aurora) at scheduler-dfad8309-de4b-47d8-a8f8-82828ea40a12@192.168.33.7:8083 3weeks to failover
>     I0914 16:51:33.203526  9812 hierarchical.cpp:382] Deactivated framework 071c44a1-b4d4-4339-a727-03a79f725851-0000
>     I0914 16:51:49.614074  9813 master.cpp:2424] Received SUBSCRIBE call for framework
'TwitterScheduler' at scheduler-6fa8b819-aed9-42e1-9c6c-3e4be2f62500@192.168.33.7:8083
>     I0914 16:51:49.614215  9813 master.cpp:2500] Subscribing framework TwitterScheduler
with checkpointing enabled and capabilities [ REVOCABLE_RESOURCES, GPU_RESOURCES ]
>     I0914 16:51:49.614312  9813 master.cpp:2564] Updating info for framework 071c44a1-b4d4-4339-a727-03a79f725851-0000
>     I0914 16:51:49.614359  9813 master.cpp:2577] Framework 071c44a1-b4d4-4339-a727-03a79f725851-0000
(TwitterScheduler) at scheduler-dfad8309-de4b-47d8-a8f8-82828ea40a12@192.168.33.7:8083 failed
over
>     I0914 16:51:49.614977  9813 hierarchical.cpp:348] Activated framework 071c44a1-b4d4-4339-a727-03a79f725851-0000
>     I0914 16:51:49.615170  9813 master.cpp:5709] Sending 1 offers to framework 071c44a1-b4d4-4339-a727-03a79f725851-0000
(TwitterScheduler) at scheduler-6fa8b819-aed9-42e1-9c6c-3e4be2f62500@192.168.33.7:8083
>     I0914 16:52:50.315119  9812 master.cpp:5447] Performing explicit task state reconciliation
for 1 tasks of framework 071c44a1-b4d4-4339-a727-03a79f725851-0000 (TwitterScheduler) at scheduler-6fa8b819-aed9-42e1-9c6c-3e4be2f62500@192.168.33.7:8083
>     
>     Restarting the scheduler after updating the config to 'TwitterScheduler' from 'Aurora':
>     
>     Rename did not take effect. The master re-registered the framework to the same id
and performed a task reconciliation.
>     
>     I0914 20:11:49.178103 28171 master.cpp:1297] Giving framework 071c44a1-b4d4-4339-a727-03a79f725851-0000
(Aurora) at scheduler-c42cd8cf-09a0-4d81-a947-094c4fac601e@192.168.33.7:8083 3weeks to failover
>     I0914 20:11:49.178138 28171 hierarchical.cpp:382] Deactivated framework 071c44a1-b4d4-4339-a727-03a79f725851-0000
>     E0914 20:11:49.183275 28178 process.cpp:2105] Failed to shutdown socket with fd 29:
Transport endpoint is not connected
>     I0914 20:12:33.277560 28177 master.cpp:2424] Received SUBSCRIBE call for framework
'Aurora' at scheduler-6dcb9baa-503f-44a9-9df6-79da717f3a1c@192.168.33.7:8083
>     I0914 20:12:33.277710 28177 master.cpp:2500] Subscribing framework Aurora with checkpointing
enabled and capabilities [ REVOCABLE_RESOURCES, GPU_RESOURCES ]
>     I0914 20:12:33.277753 28177 master.cpp:2564] Updating info for framework 071c44a1-b4d4-4339-a727-03a79f725851-0000
>     I0914 20:12:33.277784 28177 master.cpp:2577] Framework 071c44a1-b4d4-4339-a727-03a79f725851-0000
(Aurora) at scheduler-c42cd8cf-09a0-4d81-a947-094c4fac601e@192.168.33.7:8083 failed over
>     I0914 20:12:33.277961 28177 hierarchical.cpp:348] Activated framework 071c44a1-b4d4-4339-a727-03a79f725851-0000
>     I0914 20:12:33.278136 28177 master.cpp:5709] Sending 1 offers to framework 071c44a1-b4d4-4339-a727-03a79f725851-0000
(Aurora) at scheduler-6dcb9baa-503f-44a9-9df6-79da717f3a1c@192.168.33.7:8083
>     I0914 20:13:33.848175 28175 master.cpp:5447] Performing explicit task state reconciliation
for 1 tasks of framework 071c44a1-b4d4-4339-a727-03a79f725851-0000 (Aurora) at scheduler-6dcb9baa-503f-44a9-9df6-79da717f3a1c@192.168.33.7:8083
>     
>     In all the above cases the running task was not affected and was available in the
UI after the scheduler restarted.

Update the last case (restarting the Scheduler with an old framwork name):

Rename *does* take effect. The master re-registered the framework to the same id and performed
a task reconciliation.

I0914 20:34:58.059640 28176 master.cpp:1297] Giving framework 071c44a1-b4d4-4339-a727-03a79f725851-0000
(Aurora) at scheduler-4a7c21b7-5d90-4218-936e-4142051b3444@192.168.33.7:8083 3weeks to failover
I0914 20:34:58.059675 28176 hierarchical.cpp:382] Deactivated framework 071c44a1-b4d4-4339-a727-03a79f725851-0000
I0914 20:35:23.447479 28175 master.cpp:2424] Received SUBSCRIBE call for framework 'TwitterScheduler'
at scheduler-cea31751-7cb5-46b2-8208-f9ab1d4fe86c@192.168.33.7:8083
I0914 20:35:23.447573 28175 master.cpp:2500] Subscribing framework TwitterScheduler with checkpointing
enabled and capabilities [ REVOCABLE_RESOURCES, GPU_RESOURCES ]
I0914 20:35:23.447592 28175 master.cpp:2564] Updating info for framework 071c44a1-b4d4-4339-a727-03a79f725851-0000
I0914 20:35:23.447615 28175 master.cpp:2577] Framework 071c44a1-b4d4-4339-a727-03a79f725851-0000
(TwitterScheduler) at scheduler-4a7c21b7-5d90-4218-936e-4142051b3444@192.168.33.7:8083 failed
over
I0914 20:35:23.447777 28175 hierarchical.cpp:348] Activated framework 071c44a1-b4d4-4339-a727-03a79f725851-0000
I0914 20:35:23.447968 28175 master.cpp:5709] Sending 1 offers to framework 071c44a1-b4d4-4339-a727-03a79f725851-0000
(TwitterScheduler) at scheduler-cea31751-7cb5-46b2-8208-f9ab1d4fe86c@192.168.33.7:8083
I0914 20:36:24.069891 28173 master.cpp:5447] Performing explicit task state reconciliation
for 1 tasks of framework 071c44a1-b4d4-4339-a727-03a79f725851-0000 (TwitterScheduler) at scheduler-cea31751-7cb5-46b2-8208-f9ab1d4fe86c@192.168.33.7:8083


- Santhosh Kumar


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51874/#review148816
-----------------------------------------------------------


On Sept. 13, 2016, 5:18 p.m., Santhosh Kumar Shanmugham wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51874/
> -----------------------------------------------------------
> 
> (Updated Sept. 13, 2016, 5:18 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1688
>     https://issues.apache.org/jira/browse/AURORA-1688
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Change framework_name default value from 'TwitterScheduler' to 'aurora'
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md ad2c68a6defe07c94480d7dee5b1496b50dc34e5 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java
8a386bd208956eb0c8c2f48874b0c6fb3af58872 
>   src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh 97677f24a50963178a123b420d7ac136e4fde3fe

> 
> Diff: https://reviews.apache.org/r/51874/diff/
> 
> 
> Testing
> -------
> 
> ./build-support/jenkins/build.sh
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message