openwhisk-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tyson Norris <tnor...@adobe.com.INVALID>
Subject Action health checks
Date Sun, 27 Oct 2019 00:47:54 GMT
Hi Whiskers –
We periodically have an unfortunate problem where a docker container (or worse, many of them)
dies off unexpectedly, outside of HTTP usage from invoker. In these cases, prewarm or warm
containers may still have references at the Invoker, and eventually if an activation arrives
that matches those container references, the HTTP workflow starts and fails immediately since
the node is not listening anymore, resulting in failed activations. Or, any even worse situation,
can be when a container failed earlier, and a new container, initialized with a different
action is initialized on the same host and port (more likely a problem for k8s/mesos cluster
usage).

To mitigate these issues, I put together a health check process [1] from invoker to action
containers, where we can test

  *   prewarm containers periodically to verify they are still operational, and
  *   warm containers immediately after resuming them (before HTTP requests are sent)
In case of prewarm failure, we should backfill the prewarms to the specified config count.
In case of warm failure, the activation is rescheduled to ContainerPool, which typically would
either route to a different prewarm, or start a new cold container.

The test ping is in the form of tcp connection only, since we otherwise need to update the
HTTP protocol implemented by all runtimes. This test is good enough for the worst case of
“container has gone missing”, but cannot test for more subtle problems like “/run endpoint
is broken”. There could be other checks to increase the quality of test we add in the future,
but most of this I think requires expanding the HTTP protocol and state managed at the container,
and I wanted to get something working for basic functionality to start with.

Let me know if you have opinions about this, and we can discuss  here or in the PR.
Thanks
Tyson

[1] https://github.com/apache/openwhisk/pull/4698
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message