openwhisk-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tyson Norris <tnor...@adobe.com.INVALID>
Subject Re: Action health checks
Date Thu, 07 Nov 2019 16:48:58 GMT
Hi - 
As discussed, I have updated the PR to reflect:
    > - for prewarm, use the tcp connection for monitoring outside of activation
    > workflow
    > - for warm, handle it as a case of retry, where request *connection*
    > failure only for /run, will be handled by way of rescheduling back to
    > ContainerPool (/init should already be handled by retry for a time period).

Please review and provide any feedback.
https://github.com/apache/openwhisk/pull/4698

Thanks!
Tyson

On 10/30/19, 9:03 AM, "Markus Thömmes" <markusthoemmes@apache.org> wrote:

    Yes, I used the word "retry" here to mean "reschedule to another
    container", just like you would if the healthiness probe failed.
    
    A word of caution: TCP probes might be behaving strangely in a container
    setting. They sometimes accept connections even though nothing is listening
    and stuff like that.
    
    Am Mi., 30. Okt. 2019 um 16:34 Uhr schrieb Tyson Norris
    <tnorris@adobe.com.invalid>:
    
    > I don't think "retry" is the right handling for warm connection failures -
    > if a connection cannot be made due to container crash/removal, it won't
    > suddenly come back. I would instead treat it as a "reschedule", where the
    > failure routes the activation back to ContainerPool, to be scheduled to a
    > different container. I'm not sure how distinct we can be on detecting
    > contrainer failure vs temporary network issue that may or may not resolve
    > on its own, so I would treat them the same, and assume the container is
    > gone.
    >
    > So for this PR, is there any objection to:
    > - for prewarm, use the tcp connection for monitoring outside of activation
    > workflow
    > - for warm, handle it as a case of retry, where request *connection*
    > failure only for /run, will be handled by way of rescheduling back to
    > ContainerPool (/init should already be handled by retry for a time period).
    >
    > Thanks!
    > Tyson
    >
    > On 10/30/19, 7:03 AM, "Markus Thömmes" <markusthoemmes@apache.org> wrote:
    >
    >     Increasing latency would be my biggest concern here as well. With a
    > health
    >     ping, we can't even be sure that a container is still healthy for the
    > "real
    >     request". To guarantee that, I'd still propose to have a look at the
    >     possible failure modes and implement a retry mechanism on them. If you
    > get
    >     a "connection refused" error, I'm fairly certain that it can be retried
    >     without harm. In fact, any error where we can guarantee that we haven't
    >     actually reached the container can be safely retried in the described
    > way.
    >
    >     Pre-warmed containers indeed are somewhat of a different story. A
    > health
    >     ping as mentioned here would for sure help there, be it just a TCP
    > probe or
    >     even a full-fledged /health call. I'd be fine with either way in this
    > case
    >     as it doesn't affect the critical path.
    >
    >     Am Di., 29. Okt. 2019 um 18:00 Uhr schrieb Tyson Norris
    >     <tnorris@adobe.com.invalid>:
    >
    >     > By "critical path" you mean the path during action invocation?
    >     > The current PR only introduces latency on that path for the case of a
    >     > Paused container changing to Running state (once per transition from
    > Paused
    >     > -> Running).
    >     > In case it isn't clear, this change does not affect any retry (or
    > lack of
    >     > retry) behavior.
    >     >
    >     > Thanks
    >     > Tyson
    >     >
    >     > On 10/29/19, 9:38 AM, "Rodric Rabbah" <rodric@gmail.com> wrote:
    >     >
    >     >     as a longer term point to consider, i think the current model of
    > "best
    >     >     effort at most once" was the wrong design point - if we embraced
    >     > failure
    >     >     and just retried (at least once), then failure at this level
    > would
    >     > lead to
    >     >     retries which is reasonable.
    >     >
    >     >     if we added a third health route or introduced a health check,
    > would we
    >     >     increase the critical path?
    >     >
    >     >     -r
    >     >
    >     >     On Tue, Oct 29, 2019 at 12:29 PM David P Grove <
    > groved@us.ibm.com>
    >     > wrote:
    >     >
    >     >     > Tyson Norris <tnorris@adobe.com.INVALID> wrote on 10/28/2019
    >     > 11:17:50 AM:
    >     >     > > I'm curious to know what other
    >     >     > > folks think about "generic active probing from invoker" vs
    > "docker/
    >     >     > > mesos/k8s specific integrations for reacting to container
    >     > failures"?
    >     >     > >
    >     >     >
    >     >     > From a pure maintenance and testing perspective I think a
    > single
    >     > common
    >     >     > mechanism would be best if we can do it with acceptable runtime
    >     > overhead.
    >     >     >
    >     >     > --dave
    >     >     >
    >     >
    >     >
    >     >
    >
    >
    >
    

Mime
View raw message