openwhisk-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sven Lange-Last" <sven.lange-l...@de.ibm.com>
Subject Re: OpenWhisk invoker overloads - "Rescheduling Run message"
Date Fri, 19 Jul 2019 14:48:54 GMT
Hello Tyson,

regarding your feedback:

> Related to the "Rescheduling Run message", one problem we have 
> encountered in these cases is that the invoker becomes unstable due 
> ( I think) to a tight message loop, since the message that couldn't 
> run is immediately resent to the pool to be run, which fails again, 
> etc. We saw CPU getting pegged, and invoker eventually would crash.
> I have a PR related to cluster managed resources where, among other 
> things, this message looping is removed:
> https://urldefense.proofpoint.com/v2/url?
> 
u=https-3A__github.com_apache_incubator-2Dopenwhisk_pull_4326_files-23diff-2D726b36b3ab8c7cff0b93dead84311839L198&d=DwIGaQ&c=jf_iaSHvJObTbx-
> 
siA1ZOg&r=Q324lzlz3X6vUQUlgmuIdvLXO6nnIRzq6I6LyOBKHBs&m=yqwkeUxYxei_G_X3fWA0cYYm47ekuejeO6sRUKUwUos&s=KEJSKEJwE-
> zaTlnh8fovCFY4vY_uWmAQsgDsTkfueRI&e= 
> 
> Instead of resending the message to the pool immediately, it just 
> waits in the runbuffer, and the runbuffer is processed in reaction 
> to any potential change in resources: NeedWork, ContainerRemoved, 
> etc. This may add delay to any buffered message(s), but seems to 
> avoid the catastrophic crash in our systems. 

>From my point of view, your proposal on changing processing of rescheduled 
Run messages makes sense. The PR you referenced above contains a lot of 
other changes. It does not only improve this particular area but also 
includes a lot of other changes - in particular, it adds a different way 
of managing containers. Due to the PR's size and complexity, it's very 
hard to understand and review... Would you be able to split this PR up 
into smaller changes?


Mit freundlichen Grüßen / Regards,

Sven Lange-Last
Senior Software Engineer
IBM Cloud Functions
Apache OpenWhisk


E-mail: sven.lange-last@de.ibm.com
Find me on:  


Schoenaicher Str. 220
Boeblingen, 71032
Germany




IBM Deutschland Research & Development GmbH
Vorsitzende des Aufsichtsrats: Martina Koederitz
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, 
HRB 243294


Tyson Norris <tnorris@adobe.com.INVALID> wrote on 2019/07/08 17:52:14:

> From: Tyson Norris <tnorris@adobe.com.INVALID>
> To: "dev@openwhisk.apache.org" <dev@openwhisk.apache.org>
> Date: 2019/07/08 18:01
> Subject: [EXTERNAL] Re:  Re: OpenWhisk invoker overloads - 
> "Rescheduling Run message"
> 
> Related to the "Rescheduling Run message", one problem we have 
> encountered in these cases is that the invoker becomes unstable due 
> ( I think) to a tight message loop, since the message that couldn't 
> run is immediately resent to the pool to be run, which fails again, 
> etc. We saw CPU getting pegged, and invoker eventually would crash.
> I have a PR related to cluster managed resources where, among other 
> things, this message looping is removed:
> https://urldefense.proofpoint.com/v2/url?
> 
u=https-3A__github.com_apache_incubator-2Dopenwhisk_pull_4326_files-23diff-2D726b36b3ab8c7cff0b93dead84311839L198&d=DwIGaQ&c=jf_iaSHvJObTbx-
> 
siA1ZOg&r=Q324lzlz3X6vUQUlgmuIdvLXO6nnIRzq6I6LyOBKHBs&m=yqwkeUxYxei_G_X3fWA0cYYm47ekuejeO6sRUKUwUos&s=KEJSKEJwE-
> zaTlnh8fovCFY4vY_uWmAQsgDsTkfueRI&e= 
> 
> Instead of resending the message to the pool immediately, it just 
> waits in the runbuffer, and the runbuffer is processed in reaction 
> to any potential change in resources: NeedWork, ContainerRemoved, 
> etc. This may add delay to any buffered message(s), but seems to 
> avoid the catastrophic crash in our systems. 
> 
> Thanks
> Tyson
> 
> On 7/5/19, 1:16 AM, "Sven Lange-Last" <sven.lange-last@de.ibm.com> 
wrote:
> 
>     Hello Dominic,
> 
>     thanks for your detailed response.
> 
>     I guess your understanding is right - just this small correction:
> 
>     > So the main issue here is there are too many "Rescheduling 
> Run" messages 
>     in invokers?
> 
>     It's not the main issue to see these log entries in the invoker. 
This is 
>     just the indication that something is going wrong in the invoker - 
more 
>     activations are waiting to be processed than the ContainerPool can 
>     currently serve.
> 
>     Actually, there are different reasons why "Rescheduling Run message" 
log 
>     entries can show up in the invoker:
> 
>     1. Controllers send too many activations to an invoker.
> 
>     2. In the invoker, the container pool sends a Run message to a 
container 
>     proxy but the container proxy fails to process it properly and hands 
it 
>     back to the container pool. Examples: a Run message arrives while 
the 
>     proxy is already removing the container; if concurrency>1, the proxy 

>     buffers Run messages and returns them in failure situations.
> 
>     Although I'm not 100% sure, I see more indications for reason 1 in 
our 
>     logs than for reason 2.
> 
>     Regarding hypothesis "#controllers * getInvokerSlot(invoker user 
memory 
>     size) > invoker user memory size": I can rule out this hypothesis in 
our 
>     environments. We have "#controllers * getInvokerSlot(invoker user 
memory 
>     size) = invoker user memory size". I provided PR [1] to be sure 
about 
>     that.
> 
>     Regarding hypothesis "invoker simply pulls too many Run messages 
from 
>     MessageFeed". I think the part you described is perfectly right. The 

>     questions remains why controllers send too many Run messages or a 
Run 
>     message with an activation that is larger than free memory capacity 
>     currently available in the pool.
> 
>     The load balancer has a memory book-keeping for all of its invoker 
shards 
>     (memory size determined by getInvokerSlot()) so the load balancer is 

>     supposed to only schedule an activation to an invoker if the 
required 
>     memory does not exceed controller's shard of the invoker. Even if 
>     resulting Run messages arrive on the invoker in a changed order, the 

>     invoker's shard free memory should be sufficient.
> 
>     Do you see a considerable number of "Rescheduling Run message" 
> log entries 
>     in your environments?
> 
>     [1] https://urldefense.proofpoint.com/v2/url?
> 
u=https-3A__nam04.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fgithub.com-252Fapache-252Fincubator-2Dopenwhisk-252Fpull-252F4520-26amp-3Bdata-3D02-257C01-257Ctnorris-2540adobe.com-257Ca7b761bd61e943c82fd308d701211f37-257Cfa7b1b5a7b34438794aed2c178decee1-257C0-257C1-257C636979114118405554-26amp-3Bsdata-3DtRnHZ-252FN2bXgR4fWSIhvdrzCAvNmPX-252FW-252BY4BwwmVFKl0-253D-26amp-3Breserved-3D0&d=DwIGaQ&c=jf_iaSHvJObTbx-
> 
siA1ZOg&r=Q324lzlz3X6vUQUlgmuIdvLXO6nnIRzq6I6LyOBKHBs&m=yqwkeUxYxei_G_X3fWA0cYYm47ekuejeO6sRUKUwUos&s=0I0tqwtW56yO7l6zPWNNuSLlZJNYGFsQNsoq56ArSQY&e=
> 
> 
>     Mit freundlichen Grüßen / Regards,
> 
>     Sven Lange-Last
>     Senior Software Engineer
>     IBM Cloud Functions
>     Apache OpenWhisk
> 
> 
>     E-mail: sven.lange-last@de.ibm.com
>     Find me on: 
> 
> 
>     Schoenaicher Str. 220
>     Boeblingen, 71032
>     Germany
> 
> 
> 
> 
>     IBM Deutschland Research & Development GmbH
>     Vorsitzende des Aufsichtsrats: Martina Koederitz
>     Geschäftsführung: Dirk Wittkopp
>     Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht 
> Stuttgart, 
>     HRB 243294
> 
> 
> 
> 
> 



Mime
View raw message