mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Clemmer (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MESOS-6803) Agent authentication does not have an initial `delay`
Date Thu, 15 Dec 2016 19:29:58 GMT

     [ https://issues.apache.org/jira/browse/MESOS-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alex Clemmer updated MESOS-6803:
--------------------------------
    Priority: Blocker  (was: Critical)

> Agent authentication does not have an initial `delay`
> -----------------------------------------------------
>
>                 Key: MESOS-6803
>                 URL: https://issues.apache.org/jira/browse/MESOS-6803
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, scheduler driver
>            Reporter: Alex Clemmer
>            Assignee: Alex Clemmer
>            Priority: Blocker
>              Labels: microsoft, security, windows-mvp
>
> When an agent registers, there is currently a somewhat subtle difference in behavior
between the cases when it does and does not authenticate:
> * In the case that it DOES NOT authenticate, we will choose a random time between 0 and
the agent `registration_backoff_factor` to initiate registration. The reason for this is to
avoid every agent hitting the master at once during master failover. (We also employ backoff
to help this.) See: [1]
> * In the case that it DOES authenticate, we always attempt to authenticate and register
the Agent immediately. So currently in authenticated clusters, after failover, all agents
will immediately try to register with a master upon failover; though, this is helped somewhat
by the fact that the authenticated codepath still uses backoff. See: [2]
> It is important to resolve this disparity, not only to make the system more resilient,
but also because it directly blocks us from passing many tests on platforms where authentication
is not supported at all (Windows in particular).
> For some time, we have meant to make both the authenticated and unauthenticated codepaths
use a random `delay` to begin. See Adam's TODO in [3]. Historically, people seem to have had
a few problems with this:
> 1. Deep in the bowels of git history, Vinod notes[4] that the Agent might end up trying
to authenticate twice, if a new master is detected before the auth is processed. It seems
to me that this should not be an issue (or at least, not any more).
> 2. Many of our tests depend on authenticated registration happening even if `Clock::pause()`
has been called; that is, because our first attempt at authentication and Agent registration
are dispatched for immediate execution, even when we pause the clock, these events should
still happen. If we use a `delay`, then they are scheduled to happen in the future, and any
tests employing `Clock::pause` during this time will fail.
> The resolution of this bug, at minimum, involves fixing the semantics of the above tests
to pass when `HAS_AUTHENTICATION` is set to false. Following this, it is realistic to expect
that we add `delay` to the authentication codepath as well.
> In terms of resolution, it is useful to know the specific tests that will fail if `HAS_AUTHENTICATION`
is set to false:
> ```
> [  FAILED  ] ExamplesTest.V1JavaFramework
> [  FAILED  ] ExamplesTest.PythonFramework
> [  FAILED  ] FaultToleranceTest.FrameworkReregister
> [  FAILED  ] MasterAllocatorTest/0.RebalancedForUpdatedWeights, where TypeParam = mesos::internal::master::allocator::MesosAllocator<mesos::internal::master::allocator::HierarchicalAllocatorProcess<mesos::internal::master::allocator::DRFSorter,
mesos::internal::master::allocator::DRFSorter, mesos::internal::master::allocator::DRFSorter>
>
> [  FAILED  ] MasterAllocatorTest/1.RebalancedForUpdatedWeights, where TypeParam = mesos::internal::tests::Module<mesos::allocator::Allocator,
(mesos::internal::tests::ModuleID)6>
> [  FAILED  ] MasterTest.EndpointsForHalfRemovedSlave
> [  FAILED  ] MasterTest.UnreachableTaskAfterFailover
> [  FAILED  ] MasterTest.CancelRecoveredSlaveRemoval
> [  FAILED  ] MasterTest.RecoveredFramework
> [  FAILED  ] OversubscriptionTest.RescindRevocableOfferWithIncreasedRevocable
> [  FAILED  ] OversubscriptionTest.RescindRevocableOfferWithDecreasedRevocable
> [  FAILED  ] OversubscriptionTest.Reregistration
> [  FAILED  ] PartitionTest.ReregisterSlavePartitionAware
> [  FAILED  ] PartitionTest.ReregisterSlaveNotPartitionAware
> [  FAILED  ] PartitionTest.PartitionedSlaveReregistrationMasterFailover
> [  FAILED  ] PartitionTest.PartitionedSlaveOrphanedTask
> [  FAILED  ] PartitionTest.SpuriousSlaveReregistration
> [  FAILED  ] PartitionTest.PartitionedSlaveStatusUpdates
> [  FAILED  ] PartitionTest.RegistryGcByCount
> [  FAILED  ] PartitionTest.RegistryGcByAge
> [  FAILED  ] PartitionTest.RegistryGcRace
> [  FAILED  ] OneWayPartitionTest.MasterToSlave
> [  FAILED  ] ReconciliationTest.ReconcileStatusUpdateTaskState
> [  FAILED  ] ReservationTest.ACLMultipleOperations
> [  FAILED  ] ReservationTest.WithoutAuthenticationWithoutPrincipal
> [  FAILED  ] ReservationTest.WithoutAuthenticationWithPrincipal
> [  FAILED  ] SlaveTest.DuplicateTerminalUpdateBeforeAck
> [  FAILED  ] SlaveTest.StateEndpoint
> [  FAILED  ] SlaveTest.PingTimeoutNoPings
> [  FAILED  ] SlaveTest.PingTimeoutSomePings
> [  FAILED  ] SlaveTest.ReregisterWithStatusUpdateTaskState
> [  FAILED  ] SlaveTest.MaxCompletedExecutorsPerFrameworkFlag
> [  FAILED  ] ContentType/AgentAPITest.NestedContainerLaunchFalse/0, where GetParam()
= application/x-protobuf
> [  FAILED  ] ContentType/AgentAPITest.NestedContainerLaunchFalse/1, where GetParam()
= application/json
> [  FAILED  ] ContentType/AgentAPITest.NestedContainerLaunch/0, where GetParam() = application/x-protobuf
> [  FAILED  ] ContentType/AgentAPITest.NestedContainerLaunch/1, where GetParam() = application/json
> [  FAILED  ] ContentType/AgentAPITest.LaunchNestedContainerSessionAttachFailure/0, where
GetParam() = application/x-protobuf
> [  FAILED  ] ContentType/AgentAPITest.LaunchNestedContainerSessionAttachFailure/1, where
GetParam() = application/json
> [  FAILED  ] DiskResource/PersistentVolumeTest.MasterFailover/0, where GetParam() = 0
> [  FAILED  ] DiskResource/PersistentVolumeTest.AccessPersistentVolume/0, where GetParam()
= 0
> [  FAILED  ] DiskResource/PersistentVolumeTest.AccessPersistentVolume/1, where GetParam()
= 1
> [  FAILED  ] DiskResource/PersistentVolumeTest.SharedPersistentVolumeRescindOnDestroy/0,
where GetParam() = 0
> [  FAILED  ] DiskResource/PersistentVolumeTest.SharedPersistentVolumeRescindOnDestroy/1,
where GetParam() = 1
> [  FAILED  ] MountDiskResource/PersistentVolumeTest.AccessPersistentVolume/0, where GetParam()
= 2
> [  FAILED  ] MountDiskResource/PersistentVolumeTest.SharedPersistentVolumeRescindOnDestroy/0,
where GetParam() = 2
> ```
> [1] https://github.com/apache/mesos/blob/c5c5c13deab834e6db7e1f9d687b8cc0f6a0641f/src/slave/slave.cpp#L948
> [2] https://github.com/apache/mesos/blob/c5c5c13deab834e6db7e1f9d687b8cc0f6a0641f/src/slave/slave.cpp#L942
> [3] https://github.com/apache/mesos/blob/c5c5c13deab834e6db7e1f9d687b8cc0f6a0641f/src/slave/slave.cpp#L938
> [4] https://github.com/apache/mesos/commit/09b1dc3e95955aa187458fcb61e1d66b04ec3af2#diff-01648193f4029dc9fc1e024949f6ea28R562



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message