aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David McLaughlin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AURORA-1890) Job Update Pulse History is not durably stored
Date Mon, 13 Feb 2017 22:17:41 GMT

    [ https://issues.apache.org/jira/browse/AURORA-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15864569#comment-15864569
] 

David McLaughlin commented on AURORA-1890:
------------------------------------------

You're right, the write volume is totally dependent on your update volume and the pulse interval.
For many use cases, the cost of the update would be negligible. I think the real concern was
the cost of reading the last pulse time. 

One other reason why persisting the pulse is not super useful is the scheduler failover time
typically exceeds a sane pulse timeout. The same applies to automatically setting it to the
last event time (which would be preferable IMO). I think the reason we backed out of the grace
period change (which was going to be achieved by setting the timestamp to scheduler acquiring
leadership timestamp) is that it would potentially reactivate a bunch of updates that were
legitimately blocked. In the end, we agreed the churn from ROLLING_FORWARD -> BLOCKED_AWAITING_PULSE
-> ROLLING_FORWARD was harmless. But I suppose if you have automation on top of this that
reacts to state changes, it could be annoying. 

> Job Update Pulse History is not durably stored
> ----------------------------------------------
>
>                 Key: AURORA-1890
>                 URL: https://issues.apache.org/jira/browse/AURORA-1890
>             Project: Aurora
>          Issue Type: Bug
>            Reporter: Zameer Manji
>
> I have experienced the following problem with pulse updates. To reproduce:
> 1. Create an update with a pulse timeout of 1h
> 2. Send a pulse to get the update going.
> 3. Failover the scheduler immediately after.
> 4. Observe that the update is awaiting another pulse right after the failover.
> This is because the {{JobUpdateControllerImpl}} stores pulse history and state in memory
in {{PulseHandler}}. On scheduler startup, the pulse state is reset to no pulse received.
> We can solve this by durably storing the timestamp of the last pulse received in storage.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message