aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maxim Khutornenko (JIRA)" <>
Subject [jira] [Commented] (AURORA-1615) Preemptor crashes scheduler during host maintenance
Date Thu, 11 Feb 2016 22:23:18 GMT


Maxim Khutornenko commented on AURORA-1615:

The maintenance mode is currently just a hint to the scheduler to avoid using hosts scheduled
for maintenance:

The above preference order makes sure that even if someone sets the entire cluster into maintenance
mode tasks will still schedule. I think we should approach this similarly in preemptor and
still allow hosts scheduled for maintenance participate in preemption rounds.

> Preemptor crashes scheduler during host maintenance
> ---------------------------------------------------
>                 Key: AURORA-1615
>                 URL:
>             Project: Aurora
>          Issue Type: Bug
>          Components: Scheduler
>            Reporter: Maxim Khutornenko
>            Assignee: Maxim Khutornenko
> We have noticed an occasional scheduler failover when host maintenance is in effect:
> {noformat}
> To index multiple values under a key, use Multimaps.index.
>         at
>         at
>         at org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.lambda$run$224(
>         at
>         at org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(
>         at org.apache.aurora.common.inject.TimedInterceptor.invoke(
>         at
>         at
>         at
> {noformat}
> Diffing colliding HostOffer objects revealed the only difference is in HostAttributes
maintenance mode value: 
> mode=NONE vs. mode=DRAINING
> Upon examination it appears that it's quite possible to have duplicate HostOffer instances
(same offer, same slave, different maintenance mode) due to the way [offers are accessed|]
as unmodifiable view over underlying ConcurrentSkipListSet. Here is the possible sequence:
> # Pending task processor starts [building unique index|]
and the offers iterator pulls OfferA with mode=None
> # A host drain operation is initiated, a HostAttributesChanged event is raised
> # OfferManager [processes|]
HostAttributeChanged event and atomically [swaps|]
OfferA with OfferA' (mode=DRAINING)
> # inside of the uniqueIndex routine pulls OfferA' and the error is raised.
> We should either copy inside a synchronized getOffers() implementation or deal with possible
duplicates at call site. I tend to think copying on access is a better approach. The only
consumer of getOffers() is PendingTaskProcessor  with a relatively infrequent run loop (1
minute), so the perf impact of making a copy of all offers within a synchronized method should
be acceptable. The alternative implies leaking the abstraction of host maintenance mode into
the preemptor, which is less than ideal. 

This message was sent by Atlassian JIRA

View raw message