openwhisk-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tyson Norris <tnor...@adobe.com.INVALID>
Subject Re: Invoker activation queueing proposal
Date Thu, 19 Oct 2017 21:32:07 GMT
More on the queueing proposal:

One problem with a single unified overflow topic is the case  of multiple controllers WITH
shared data, where:
- all controllers are in overflow state
- activation A1 arrives at controller C0, timeout scheduled, message sent to overflow topic
- controller C1 gets a view on invokers with capacity BEFORE controller C0 does - begins processing
overflow topic
- activation A1 is now being processed by C1 (but the request and initial timeout is being
processed on controller C0)

For this case, I am thinking that it should be possible to use a different controller to process
the activation than the controller that originally received the request/activation(!). 
To do this:
- the timeout scheduled for for the overflow-processing controller, would consider time already
spent waiting for invokers. (A failed completion messages would be sent to the initial controllers
completion topic in case of timeout)
- when activation is scheduled to an invoker, the original controller (if different) must
be noted in the LoadBalancerData entry
- successful completion messages from invoker would be sent to the processing controller's
completion topic (and THEN be sent to the initial controllers completion topic), since both
controllers are waiting for timeout of that activation processing in invoker

For the case of multiple controllers WITHOUT shared data: I think the same approach works,
so overflow activations would be processed asap by any available controller + any available
invoker, as opposed to current controller waiting for its own in-flight activations to complete
before overflow is processed. (Only difference is that invokers will be overscheduled due
to inability of controllers to view other controllers’ scheduled activations).

I’ll try to come up with a diagram to describe this but wanted to mention it to see if people
have feedback on the idea in the meantime. 

Thanks
Tyson
 





> On Oct 10, 2017, at 10:34 AM, Markus Thömmes <markusthoemmes@me.com> wrote:
> 
> Heyho,
> 
> I ran into the same issue before and I think our scheduling code should be an Actor.
We could microbenchmark it to assure it can happily schedule a large amount of actions per
second to not become a bottleneck.
> 
> +1 for actorizing the LB
> 
> Cheers,
> Markus
> 
> Von meinem iPhone gesendet
> 
>> Am 10.10.2017 um 13:28 schrieb Tyson Norris <tnorris@adobe.com.INVALID>:
>> 
>> Hi - 
>> Following up on this, I’ve been working on a PR. 
>> 
>> One issue I’ve run into (which may be problematic in other scheduling scenarios)
is that the scheduling in LoadBalancerService doesn’t respect the new async nature of activation
counting in LoadBalancerData. At least I think this is a good description. 
>> 
>> Specifically, I am creating a test that submits activations via LoadBalancer.publish,
and I end up with 10 activations scheduled on invoker0, even though I use an invokerBusyThreshold
of 1.
>> It would only occur when concurrent requests (or very short time between?) arrive
at the same controller, I think. (Otherwise the counts can sync up quickly enough)
>> I’ll work more on testing it.
>> 
>> Assuming this (dealing with async counters) is the problem, I’m not exactly sure
how to deal with it. Some options may include:
>> - change LoadBalancer to an actor, so that local counter states can be easier managed
(these would still need to replicate, but at least locally it would do the right thing) 
>> - coordinate the schedule + setupActivation calls to also rely on some local state
for activations that should be counted but have not yet been processed within LoadBalancerData
>> 
>> Any suggestions in this area would be great.
>> 
>> Thanks
>> Tyson
>> 
>> 
>> 
>>> On Oct 6, 2017, at 11:04 AM, Tyson Norris <tnorris@adobe.com.INVALID> wrote:
>>> 
>>> With many invokers, there is less data exposed to rebalancing operations, since
the invoker topics will only ever receive enough activations that can be processed “immediately",
currently set to 16. The single backlog topic would only be consumed by the controller (not
any invoker), and the invokers would only consumer their respective “process immediately”
topic - which effectively has no, or very little, backlog - 16 max. My suggestion is that
having multiple backlogs is an unnecessary problem, regardless of how many invokers there
are.
>>> 
>>> It is worth noting the case of multiple controllers as well, where multiple controllers
may be processing the same backlog topic. I don’t think this should cause any more trouble
than the distributed activation counting that should be enabled via controller clustering,
but it may mean that if one controller enters overflow state, it should signal that ALL controllers
are now in overflow state, etc.
>>> 
>>> Regarding “timeout”, I would plan to use the existing timeout mechanism,
where an ActivationEntry is created immediately, regardless of whether the activation is going
to get processed, or get added to the backlog. At time of processing the backlog message,
if the entry is timed out, throw it away. (The entry map may need to be shared in the case
multiple invokers are in use, and they all consume from the same topic; alternatively, we
can partition the topic so that entries are only processed by the controller that has backlogged
them)
>>> 
>>> Yes, once invokers are saturated, and backlogging begins, I think all incoming
activations should be sent straight to backlog (we already know that no invokers are available).
This should not hurt overall performance anymore than it currently does, and should be better
(since the first invoker available can start taking work, instead of waiting on a specific
invoker to become available).
>>> 
>>> I’m working on a PR, I think much of these details will come out there, but
in the meantime, let me know if any of this doesn’t make sense.
>>> 
>>> Thanks
>>> Tyson
>>> 
>>> 
>>> On Oct 5, 2017, at 2:49 PM, David P Grove <groved@us.ibm.com<mailto:groved@us.ibm.com>>
wrote:
>>> 
>>> 
>>> I can see the value in delaying the binding of activations to invokers when the
system is loaded (can't execute "immediately" on its target invoker).
>>> 
>>> Perhaps in ignorance, I am a little worried about the scalability of a single
backlog topic. With a few hundred invokers, it seems like we'd be exposed to frequent and
expensive partition rebalancing operations as invokers crash/restart. Maybe if we have N =
K*M invokers, we can get away with M backlog topics each being read by K invokers. We could
still get imbalance across the different backlog topics, but it might be good enough.
>>> 
>>> I think we'd also need to do some thinking of how to ensure that work put in
a backlog topic doesn't languish there for a really long time. Once we start having work in
the backlog, do we need to stop putting work in immediately topics? If we do, that could hurt
overall performance. If we don't, how will the backlog topic ever get drained if most invokers
are kept busy servicing their immediately topics?
>>> 
>>> --dave
>>> 
>>> Tyson Norris ---10/04/2017 07:45:38 PM---Hi - I’ve been discussing a bit with
a few about optimizing the queueing that goes on ahead of invok
>>> 
>>> From:  Tyson Norris <tnorris@adobe.com.INVALID<mailto:tnorris@adobe.com.INVALID>>
>>> To:  "dev@openwhisk.apache.org<mailto:dev@openwhisk.apache.org>" <dev@openwhisk.apache.org<mailto:dev@openwhisk.apache.org>>
>>> Date:  10/04/2017 07:45 PM
>>> Subject:  Invoker activation queueing proposal
>>> 
>>> ________________________________
>>> 
>>> 
>>> 
>>> Hi -
>>> 
>>> I’ve been discussing a bit with a few about optimizing the queueing that goes
on ahead of invokers so that things behave more simply and predictable.
>>> 
>>> 
>>> 
>>> In short: Instead of scheduling activations to an invoker on receipt, do the
following:
>>> 
>>> - execute the activation "immediately" if capacity is available
>>> 
>>> - provide a single overflow topic for activations that cannot execute “immediately"
>>> 
>>> - schedule from the overflow topic when capacity is available
>>> 
>>> 
>>> 
>>> (BTW “Immediately” means: still queued via existing invoker topics, but ONLY
gets queued there in the case that the invoker is not fully loaded, and therefore should execute
it “very soon")
>>> 
>>> 
>>> 
>>> Later: it would also be good to provide more container state data from invoker
to controller, to get better scheduling options - e.g. if some invokers can handle running
more containers than other invokers, that info can be used to avoid over/under-loading the
invokers (currently we assume each invoker can handle 16 activations, I think)
>>> 
>>> 
>>> 
>>> I put a wiki page proposal here: https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__cwiki.apache.org_confluence_display_OPENWHISK_Invoker-2BActivation-2BQueueing-2BChange%26d%3DDwIGaQ%26c%3Djf_iaSHvJObTbx-siA1ZOg%26r%3DFe4FicGBU_20P2yihxV-apaNSFb6BSj6AlkptSF2gMk%26m%3DUE8OIR_GnMltmRZyIuLVHMlzyQvNku-H7kLk67u45IM%26s%3DLD75-npfzA7qzUGNgYbFBy4qKatnkdO5I2vKYSGUBg8%26e&data=02%7C01%7C%7C206a28eb8f9d4f3b131608d50ce4be0f%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636429098932393323&sdata=pLsSnlJRYtL4cHMqciGBsA9kLaHzW1GjbijpJCsD1po%3D&reserved=0=<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__cwiki.apache.org_confluence_display_OPENWHISK_Invoker-2BActivation-2BQueueing-2BChange%26d%3DDwIGaQ%26c%3Djf_iaSHvJObTbx-siA1ZOg%26r%3DFe4FicGBU_20P2yihxV-apaNSFb6BSj6AlkptSF2gMk%26m%3DUE8OIR_GnMltmRZyIuLVHMlzyQvNku-H7kLk67u45IM%26s%3DLD75-npfzA7qzUGNgYbFBy4qKatnkdO5I2vKYSGUBg8%26e%3D&data=02%7C01%7C%7C36a3439a232c45d2119c08d50c3b096b%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636428370067764903&sdata=MBzAhIAVOdHCG0acu8YKCNmeYXO8T9PcILoQrlUyixw%3D&reserved=0>
>>> 
>>> 
>>> 
>>> WDYT?
>>> 
>>> 
>>> 
>>> Thanks
>>> 
>>> Tyson
>>> 
>> 

Mime
View raw message