openwhisk-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From TzuChiao Yeh <su3g4284zo...@gmail.com>
Subject Re: New scheduling algorithm proposal.
Date Sun, 24 Jun 2018 09:24:10 GMT
Hi Dominic and Whiskers,



Sorry for late reply, I’ve been exhausted on school’s final in the past two
weeks...



First of all, I still think your proposal is great. Before continuation, I
think we can model these into a more well-formed structure. Let me recap
OpenWhisk (more generally speaking, Serverless) workloads and my opinions.



Performance characteristics of OpenWhisk (Serverless):



The workloads of serverless applications are inherently short running by
default, although there’s not that hard restricted. There are two types of
time factors of invocation: activation processing and container operations.



To clarify on my definition,

Activation processing time = In-container processing time (after init).

Container operations time = container creation/deletion/resume/pause time.

Cold start time = Container operations time + init.



Assume there’s only one slot:

Turnaround time = Sigma(i) Reusable(i) * Cold start time (i) + Activation
processing time (i)



First, there’s only two possibilities on activation processing, reuse and
wait. Therefore, the performance of activation processing is inherently
passively affected by container operations. However, how many/heavy
activations come in will affect the container operation decision.


Note that: the weight of an activation is not deterministic currently, but
there might be some mechanism for us to distinct how long activation
processed in the future.



Second, container operations, simply divided by creation and deletion to
avoid complexity here.

1.    Container creation: need to calculate the resource and how many/heavy
activations existed to find the (near) optimal placement.

2.    Container deletion: same as need to calculate the resources and how
many/heavy activations existed.



Therefore, some dimensions that we can model here,

1.    Resources: CPU, Memory, etc.

2.    Action runtime type.

3.    Total weight of activations (number of queuing activations * weight
of activation).

4.    …



Simply concluded, current OpenWhisk load balancer is suffered from no
design on total weight of activations in the scheduling logic. Dominic
already gave the possibility on it: the mechanism on checking Kafka’s lags
can model the number of queuing activations. But the problem still come
with no other dimensions consideration on scheduling.


Some issues we need to clarify:

1.    From Dominic, the variation on invoker side might under a short
period, and the interval on reporting states from invokers to controller is
not short enough.

2.    Consideration on scheduler overhead in large scale cluster.



In my opinion:

1.    We can emit events and timestamp on every container operations and
activation processing (dynamic), associated with invokers health checks
(static).

2.    Explore the possibility on standalone scheduler.



Whether the scheduler is standalone or not, the mainly idea is to make the
data-plane cleaner. Quite simple and similar to **overflow topic** proposal
from Tyson by using additional topic(s):

1.    Dynamic topic checking, (partition) creation, and deletion for each
action.

2.    Use the cold start buses to drive the scheduling when no topic is
existed (cache miss). Scheduler will dynamically create the action topic
and associate containers.

3.    Drop out the topic if there’s long time not used.


Here's the conceptual diagram:

https://drive.google.com/file/d/1NPuB7r9fxZhMNkLGClZYbdCfDKCxY9pi/view?usp=sharing


More details need to be clarified and may not be matured, but the
advantages on the mechanism might be:

1.    The latency is still minimized when cache-hit, once the cold-start is
occurred, the latency on container operation will mask the round-trip
latency on scheduler.

2.    It can host a different computing intensive executor on a more
complex scheduling. (i.e. a graph-based optimization)

3.    Still guarantee persistence on Kafka when activation got cold-started.

4.  Exploring scheduling possibilities on large scale cluster.


I think the container scheduling in Serverless might be similar to various
scheduling algorithm in cluster management system or container
orchestration system, such as Mesos, Kubernetes, etc. Instead of thinking
based on a network load balancer.


For example, these five models are classified by a great article from
*Firmament *[1], just listing short description on characteristics here.

1.    Monolithic scheduler: Single scheduler process runs on one machine
and assigns tasks to machines.

2.    Two-level scheduler: Separating the concerns of resource
allocationand task
placement. Workload-specific schedulers interact with a resource manager
that carves out dynamic partitions of the cluster resources for each
workload.

3.    Shared-state scheduler: “Semi-distributed model”, multiple replicas
of cluster state are independently updated by application-level schedulers.

4.    Fully-distributed scheduler: No coordination between schedulers at
all, service the incoming workloads by independent schedulers.

5.    Hybrid: Combining distributed architecture with monolithic or
shared-state designs.



I’m still studying on some academic references and map to OW and Serverless
architecture. Since I’ve no experience and not that familiar with large
scale distributed cluster scheduling (and no resource, either). Just try to
sort out my thoughts, please let me know if I’m understanding wrong. I
think there’s already some folks had much experiences and research on large
scale distributed cluster scheduling, that would be nice if we can discuss
these more.


Thanks!

TzuChiao


[1] http://firmament.io/blog/scheduler-architectures.html




Dominic Kim <style9595@gmail.com> 於 2018年6月9日 週六 上午12:24寫道:

> Hi TzuChiao
>
> Those are great and fair comments!
>
> 1. Kafka utilization.
>
> Regarding what is written in link[1], it's correct.
> More partitions lead to higher throughput and latency.
>
> But the number of partitions in one Kafka node is limited to some level in
> my proposal.
> It does not increase infinitely as we would add more servers to support
> more concurrent actions.
> As I shared in my previous email, all partitions(no matter of topic) would
> be evenly distributed among nodes.
> So overhead from multiple partitions can be limited in one node, and
> topic-wise overhead can be distributed among multiple nodes.
>
> With regard to your question on batch processing, yes, current MessageFeed
> fetches messages in batch.
> But inherently activation processing can't be done in batch.
> Even though it fetches a bunch of messages, invoker should(and does) handle
> activation in serial order(concurrently).
> Because if they commit offsets in batch, there is a possibility to invoke
> actions multiple times in case of failure.
> In turn, committing offset is done one by one.
>
> If you are only mentioning about fetching multiple messages, that can be
> achieved in my proposal as well because consumers are dedicated to a given
> topic and it's safe to fetch multiple messages.
> (However, committing offset still works in serial)
>
> Each consumers in a same group will be assigned one partition respectively.
> And they can fetch as many messages as they want if they commit offsets one
> by one.
>
> 2. Resource-aware scheduling.
>
> Actually, I think optimal scheduling based on real resource usage is not
> feasible in a current architecture.
> This is because real resource resides in invoker side, but scheduling
> decision is made by controllers.
> And resource status can change in 5ms ~ 10ms as the execution time of
> action can be very less.
> (I observed 2ms execution time for some actions.)
>
> So all invokers should share their extremely frequently changing status to
> all controllers,
> and a controller should schedule activations to optimal invokers along with
> considering other controllers.
> (Because there is a possibility that other controllers can also schedule
> activations to same the invoker.)
> And all these procedures should be done within lesser than 5ms.
>
> So I think there will always be some gap between real resource usages and
> status kept by controllers.
> And this is the reason why I made a proposal which utilizes an asynchronous
> way.
>
> Regarding sharding concept at loadbalancer, I just refer to it to minimize
> the intervention among controllers when scheduling.
> Since scheduling for container creation and activations are segregated, and
> container creation process can work in an asynchronous way, I think that
> would be enough.
> If there is a better way to handle this, that would be better.
>
> Finally, regarding the word, autonomous, I just named it because a
> container itself can fetch and handle activation messages without any
> intervention of invokers : )
>
> Anyway, thank you for very valuable comments.
> I hope this would help.
>
> Best regards
> Dominic.
>
>
>
>
> 2018-06-07 21:55 GMT+09:00 TzuChiao Yeh <su3g4284zo6y7@gmail.com>:
>
> > Hi Dominic,
> >
> > I really like your proposal! Thanks for your awesome presentation and
> > materials, help me a lot.
> >
> > I have some opinions and questions here about the proposal and previous
> > discussions:
> >
> > 1. About kafka utilization:
> >
> > First of all, bypassing invokers is a great idea, though this will lead
> to
> > lots of hard work on various runtime. The possibility on utilizing "hot"
> > containers and parallelism also looks well.
> >
> > I'm not a kafka expert at all, but there are some external references
> > talking about large number of partitions. [1]
> >
> > One question here -
> >
> > From my own investigation, current implementation on messaging (consumer)
> > uses batch (offset) to enhance performance. Since your proposed algorithm
> > assign each topic to an action, more accurately speaking, each partition
> to
> > a running container. It might drop the capability on using batch and I'm
> > not sure that how much overhead will take? Or there might be some
> advanced
> > design for balancing number of partition and offset?
> >
> > 2. Controller side:
> >
> > I pay more interests on the future enhancement for extending more states
> > (from invokers) for controllers. I.e. the more accurate resource-aware
> > scheduling (i.e. memory-aware that Markus proposed in previous dev list),
> > package aware scheduling for package caching [2] and control the running
> > container's reserved time that you've mentioned ---> warm/hot containers
> > utilization.
> >
> > At the first time, I'm confused of the "autonomous" word maps to your
> > algorithm. I think the scheduling logic still sit in controller side (for
> > checking lags, limits and so on).
> > Seems like the proposed algorithm will not drop out the current sharding
> > loadbalancer, can you share more experience or plans on integrating these
> > in controller side?
> >
> > [1]
> > https://www.confluent.io/blog/how-to-choose-the-number-of-
> > topicspartitions-in-a-kafka-cluster/
> > [2] https://dl.acm.org/citation.cfm?id=3186294&dl=ACM&coll=DL
> >
> > Thanks,
> > TzuChiao
> >
> > Dominic Kim <style9595@gmail.com> 於 2018年6月7日 週四 上午1:08寫道:
> >
> > > Sorry.
> > > Let me share Kafka benchmark results again.
> > >
> > > | # of topics  |   Kafka TPS |
> > > |    50  |   34,488 |
> > > |  100  |  34,502 |
> > > |  200  |   31,781 |
> > > |  500  |   30,324 |
> > > | 1000  |  30,855 |
> > >
> > > Best regards
> > > Dominic
> > >
> > >
> > > 2018-06-07 2:04 GMT+09:00 Dominic Kim <style9595@gmail.com>:
> > >
> > > > Sorry for the late response Tyson.
> > > >
> > > > Let me first answer your second question.
> > > > Vuser is just the number of threads to send the requests.
> > > > Each Vusers randomly picked the namespace and send the request using
> > REST
> > > > API.
> > > > So they are independent of the number of namespaces.
> > > >
> > > > And regarding performance degradation on the number of users, I think
> > it
> > > > works a little bit differently.
> > > > Even though I have only 2 users(namespaces), if their homeInvoker is
> > > same,
> > > > TPS become very less.
> > > > So it is a matter of the number of actions whose homeInvoker are same
> > > > though more the number of users than the number of containers can
> harm
> > > the
> > > > performance.
> > > > This is because controller should send those actions to the same
> > invoker
> > > > even though there are other idle invokers.
> > > > In my proposal, controllers can schedule activation to any invokers
> so
> > it
> > > > does not happen.
> > > >
> > > >
> > > > And regarding the issue about the sheer number of Kafka topics, let
> me
> > > > share my idea.
> > > >
> > > > 1. Data size is not changed.
> > > >
> > > > If we have 1000 activation requests, they will be spread among
> invoker
> > > > topics. Let's say we have 10 invokers, then ideally each topic will
> > have
> > > > 100 messages.
> > > > In my proposal, if I have 10 actions, each topic will have 100
> messages
> > > as
> > > > well.
> > > > Surely there will be more number of actions than the number of
> > invokers,
> > > > data will be spread to more topics, but data size is unchanged.
> > > >
> > > > 2. Data size depends on the number of active actions.
> > > >
> > > > For example, if we have one million actions, in turn, one million
> > topics
> > > > in Kafka.
> > > > If only half of them are executed, then there will be data only for
> > half
> > > > of them.
> > > > For rest half of topics, there will be no data and they won't affect
> > the
> > > > performance.
> > > >
> > > > 3. Things to concern.
> > > >
> > > > Let me describe what happens if there are more number of Kafka
> topics.
> > > >
> > > > Let's say there are 3 invokers with 5 activations each in the current
> > > > implementation, then it would look like this.
> > > >
> > > > invoker0: 0 1 2 3 4 5 (5 messages) -> consumer0
> > > > invoker1: 0 1 2 3 4 5 -> consumer1
> > > > invoekr2: 0 1 2 3 4 5 -> consumer2
> > > >
> > > > Now If I have 15 actions with 15 topics in my proposal.
> > > >
> > > > action0: 0  -> consumer0
> > > > action1: 0  -> consumer1
> > > > action2: 0  -> consumer2
> > > > action3: 0  -> consumer3
> > > > .
> > > > .
> > > > .
> > > > action14: 0  -> consumer14
> > > >
> > > > Kafka utilizes page cache to maximize the performance.
> > > > Since the size of data is not changed, data kept in page cache is
> also
> > > not
> > > > changed.
> > > > But the number of parallel access to data is increased. I think it
> > might
> > > > be some overhead.
> > > >
> > > > That's the reason why I performed benchmark with multiple topics.
> > > >
> > > > # of topics
> > > >
> > > > Kafka TPS
> > > >
> > > > 50
> > > >
> > > > 34,488
> > > >
> > > > 100
> > > >
> > > > 34,502
> > > >
> > > > 200
> > > >
> > > > 31,781
> > > >
> > > > 500
> > > >
> > > > 30,324
> > > >
> > > > 1,000
> > > >
> > > > 30,855
> > > >
> > > > As you can see there are some overheads from increased parallel
> access
> > to
> > > > data.
> > > > Here we can see about 4,000 TPS degraded as the number of topics
> > > increased.
> > > >
> > > > But we can still support 30K TPS with 1000 topics using 3 Kafka
> nodes.
> > > > If we need more TPS we can just add more nodes.
> > > > Since Kafka can horizontally scale-out, if we add 3 more servers, I
> > > expect
> > > > we can get 60K TPS.
> > > >
> > > > Partitions in Kafka are evenly distributed among nodes in a cluster.
> > > > Each nodes becomes a leader for each partition. If we have 100
> > partitions
> > > > with 4 Kafka nodes, ideally each Kafka nodes will be a leader for 25
> > > > partitions.
> > > > Then consumers can directly read messages from different partition
> > > leaders.
> > > > This is why Kafka can horizontally scale-out.
> > > >
> > > > Even though the number of topics is increased, if we add more Kafka
> > > nodes,
> > > > the number of partitions which is managed by one Kafka node would be
> > > > unchanged.
> > > > So if we can support 30K TPS with 1000 topics using 3 nodes, then we
> > can
> > > > still get 60K TPS with 2000 topics using 6 nodes.
> > > > Similarly, if we have enough Kafka nodes, the number of partitions in
> > one
> > > > Kafka nodes will be same though we have one million concurrent
> > > invocations.
> > > >
> > > > This is what I am thinking.
> > > > If I miss anything, kindly let me know.
> > > >
> > > > Thanks
> > > > Regards
> > > > Dominic.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > 2018-05-26 13:22 GMT+09:00 Tyson Norris <tnorris@adobe.com.invalid>:
> > > >
> > > >> Hi Dominic -
> > > >>
> > > >> Thanks for the detailed presentation! It is helpful to go through to
> > > >> understand your points - well done.
> > > >>
> > > >>
> > > >> A couple of comments:
> > > >>
> > > >> - I'm not sure how unbounded topic (and partition) growth will be
> > > >> handled, realistically. AFAIK, these are not free of resource
> > > consumption
> > > >> at the client or.
> > > >>
> > > >> - In your test it looks like you have 990 vusers example (pdf page
> > 121)
> > > -
> > > >> are these using different namespaces? I ask because I think the
> > current
> > > >> impl isolates the container per namespace, so if you are limited to
> > 180
> > > >> containers, I can see how there will be container removal/restarts
> in
> > > the
> > > >> case where the number of users greatly outnumbers the number of
> > > containers
> > > >> - I'm not sure if the test behaves this way, or your "New
> > > implementation"
> > > >> behaves similar? (does a container get reused for many different
> > > >> namespaces?)
> > > >>
> > > >>
> > > >> I'm interested to know if there are any kafka experts here that can
> > > >> provide more comments on the topics/partition handling question? I
> > will
> > > >> also ask for some additional feedback from other colleagues.
> > > >>
> > > >>
> > > >> I will gather some more comments to share, but wanted to start off
> > with
> > > >> these. Will continue next week after the long (US holiday) weekend.
> > > >>
> > > >>
> > > >> Thanks
> > > >>
> > > >> Tyson
> > > >>
> > > >>
> > > >> ________________________________
> > > >> From: Dominic Kim <style9595@gmail.com>
> > > >> Sent: Friday, May 25, 2018 1:58:55 AM
> > > >> To: dev@openwhisk.apache.org
> > > >> Subject: Re: New scheduling algorithm proposal.
> > > >>
> > > >> Dear Whiskers.
> > > >>
> > > >> I uploaded the material that I used to give a speech at last
> biweekly
> > > >> meeting.
> > > >>
> > > >> https://na01.safelinks.protection.outlook.com/?url=https%3A%
> > > >> 2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FOPENW
> > > >> HISK%2FAutonomous%2BContainer%2BScheduling&data=02%7C01%
> > > >> 7Ctnorris%40adobe.com%7C0c84bff555fb4990142708d5c21dc5e8%7Cf
> > > >> a7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636628355491109416
> > > >> &sdata=qppUpTRm%2BkTR5EueiLg7Ix6xiGWlzk5WDn3DxJv032w%3D&reserved=0
> > > >>
> > > >> This document mainly describes following things:
> > > >>
> > > >> 1. Current implementation details
> > > >> 2. Potential issues with current implementation
> > > >> 3. New scheduling algorithm proposal: Autonomous Container
> Scheduling
> > > >> 4. Review previous issues with new scheduling algorithm
> > > >> 5. Pros and cons of new scheduling algorithm
> > > >> 6. Performance evaluation with prototyping
> > > >>
> > > >>
> > > >> I hope this is helpful for many Whiskers to understand the issues
> and
> > > new
> > > >> proposal.
> > > >>
> > > >> Thanks
> > > >> Best regards
> > > >> Dominic
> > > >>
> > > >>
> > > >> 2018-05-18 18:44 GMT+09:00 Dominic Kim <style9595@gmail.com>:
> > > >>
> > > >> > Hi Tyson
> > > >> > Thank you for comments.
> > > >> >
> > > >> > First, total data size in Kafka would not be changed, since the
> > number
> > > >> of
> > > >> > activation requests will be same though we activate more topics.
> > > >> > Same amount of data will be just split into different topics, so
> > there
> > > >> > would be no need for more disk space in Kafka.
> > > >> > But it will increase parallel processing at Kafka layer, more
> number
> > > of
> > > >> > consumers will access to Kafka at the same time.
> > > >> >
> > > >> > The reason why active/inactive is meaningful in the scene is, the
> > size
> > > >> of
> > > >> > parallel processing will be dependent on it.
> > > >> > Though we have 100k and more topics(actions), if only 10 actions
> are
> > > >> being
> > > >> > invoked, there will be only 10 parallel processing.
> > > >> > The number of inactive topics does not affect the performance of
> > > active
> > > >> > consumers.
> > > >> > If we really need to support 100k parallel action invocation,
> surely
> > > we
> > > >> > need more nodes to handle them not only for just Kafka.
> > > >> > Kafka can horizontally scale out and number of active topics at
> some
> > > >> point
> > > >> > will always be lesser than the total number of topics,  Based on
> my
> > > >> > benchmark results, I expect it is enough to take with scale-out of
> > > >> Kafka.
> > > >> >
> > > >> > Regarding topic cleanup, we don't need to clean them up by
> > ourselves,
> > > >> > Kafka will clean up expired data based on retention configuration.
> > > >> > So if the topic is no more activated(the action is no more
> invoked),
> > > >> there
> > > >> > would be no actual data though the topic exists.
> > > >> > And as I said, even if data exists for some topics, they won't
> > affect
> > > >> > other active consumers if they are not activated.
> > > >> >
> > > >> > Regarding concurrent activation PR, it is very worthy change. And
> I
> > > >> > recognize it is orthogonal to my PR.
> > > >> > It will not only alleviate current performance issue but can be
> used
> > > >> with
> > > >> > my changes as well.
> > > >> >
> > > >> > In current logics, controller schedule activations based on hash,
> > your
> > > >> PR
> > > >> > would be much effective if some changes are made on scheduling
> > logic.
> > > >> >
> > > >> > Regarding bypassing kafka, I am inclined to use Kafka because it
> can
> > > act
> > > >> > as a kind of buffer.
> > > >> > If some activations are not processed due to some reasons such as
> > lack
> > > >> of
> > > >> > resources or invoker failure and so on, Kafka can keep them up for
> > > some
> > > >> > times and guarantee `at most once` invocation though invocation
> > might
> > > >> be a
> > > >> > bit delayed.
> > > >> >
> > > >> > With regard to combined approach, I think that is great idea.
> > > >> > For that, container states should be shared among invokers and
> they
> > > can
> > > >> > send activation request to any containers(local, or remote).
> > > >> > As so invokers will utilize warmed resources which are not located
> > in
> > > >> its
> > > >> > local.
> > > >> >
> > > >> > But it will also introduce some synchronization issue among
> > > controllers
> > > >> > and invokers or it needs segregation between resources based
> > > scheduling
> > > >> at
> > > >> > controller and real invocation.
> > > >> > In the earlier case, since controller will schedule activations
> > based
> > > on
> > > >> > resources status, it is required to synchronize them in realtime.
> > > >> > Invokers can send requests to any remote containers, there will be
> > > >> > mismatch in resource status between controllers and invokers.
> > > >> >
> > > >> > In the later case, controller should be able to send requests to
> any
> > > >> > invokers then invoker will schedule the activations.
> > > >> > In this case also, invokers need to synchronize their container
> > status
> > > >> > among them.
> > > >> >
> > > >> > Under the situation all invokers have same resources status, if
> two
> > > >> > invokers received same action invocation requests, it's not easy
> to
> > > >> control
> > > >> > the traffic among them, because they will schedule requests to
> same
> > > >> > containers. And if we take similar approach with what you
> suggested,
> > > to
> > > >> > send intent to use the containers first, it will introduce
> > increasing
> > > >> > latency overhead as more and more invokers joined the cluster.
> > > >> > I couldn't find any good way to handle this yet. And this is why I
> > > >> > proposed autonomous containerProxy to enable location free
> > scheduling.
> > > >> >
> > > >> > Finally regarding SPI, yes you are correct, ContainerProxy is
> highly
> > > >> > dependent on ContainerPool, I will update my PR as you guided.
> > > >> >
> > > >> > Thanks
> > > >> > Regards
> > > >> > Dominic.
> > > >> >
> > > >> >
> > > >> > 2018-05-18 2:22 GMT+09:00 Tyson Norris <tnorris@adobe.com.invalid
> >:
> > > >> >
> > > >> >> Hi Dominic -
> > > >> >>
> > > >> >> I share similar concerns about an unbounded number of topics,
> > despite
> > > >> >> testing with 10k topics. I’m not sure a topic being considered
> > active
> > > >> vs
> > > >> >> inactive makes a difference from broker/consumer perspective? I
> > think
> > > >> there
> > > >> >> would minimally have to be some topic cleanup that happens, and
> I’m
> > > not
> > > >> >> sure the impact of deleting topics in bulk will have on the
> system
> > > >> either.
> > > >> >>
> > > >> >> A couple of tangent notes related to container reuse to improve
> > > >> >> performance:
> > > >> >> - I’m putting together the concurrent activation PR[1] (to allow
> > > reuse
> > > >> of
> > > >> >> an warm container for multiple activations concurrently); this
> can
> > > >> improve
> > > >> >> throughput for those actions that can tolerate it (FYI per-action
> > > >> config is
> > > >> >> not implemented yet). It suffers from similar inaccuracy of kafka
> > > >> message
> > > >> >> ingestion at invoker “how many messages should I read”? But I
> think
> > > we
> > > >> can
> > > >> >> tune this a bit by adding some intelligence to
> Invoker/MessageFeed
> > > >> like “if
> > > >> >> I never see ContainerPool indicate it is busy, read more next
> > time” -
> > > >> that
> > > >> >> is, allow ContainerPool to backpressure MessageFeed based on
> > ability
> > > to
> > > >> >> consume, and not (as today) strictly on consume+process.
> > > >> >>
> > > >> >> - Another variant we are investigating is putting a ContainerPool
> > > into
> > > >> >> Controller. This will prevent container reuse across controllers
> > > >> (bad!),
> > > >> >> but will bypass kafka(good!). I think this will be plausible for
> > > >> actions
> > > >> >> that support concurrency, and may be useful for anything that
> runs
> > as
> > > >> >> blocking to improve a few ms of latency, but I’m not sure of all
> > the
> > > >> >> ramifications yet.
> > > >> >>
> > > >> >>
> > > >> >> Another (more far out) approach combines some of these is
> changing
> > > the
> > > >> >> “scheduling” concept to be more resource reservation and garbage
> > > >> >> collection. Specifically that the ContainerPool could be a
> > > combination
> > > >> of
> > > >> >> self-managed resources AND remote managed resources. If no proper
> > > >> (warm)
> > > >> >> container exists locally or remotely, a self-managed one is
> > created,
> > > >> and
> > > >> >> advertised. Other ContainerPool instances can leverage the remote
> > > >> resources
> > > >> >> (containers). To pause or remove a container requires advertising
> > > >> intent to
> > > >> >> change state, and giving clients time to veto. So there is some
> > added
> > > >> >> complication in the start/reserver/pause/rm container lifecycle,
> > but
> > > >> the
> > > >> >> case for reuse is maximized in best case scenario (concurrency
> > > tolerant
> > > >> >> actions) and concurrency intolerant actions have a chance to
> > > leverage a
> > > >> >> broader pool of containers (iff the ability to reserve a shared
> > > >> available
> > > >> >> container is fast enough, compared to starting a new cold one).
> > There
> > > >> is a
> > > >> >> lot wrapped in there (how are resources advertised, what are the
> > new
> > > >> states
> > > >> >> of lifecycle, etc), so take this idea with a grain of salt.
> > > >> >>
> > > >> >>
> > > >> >> Specific to your PR: do you need an SPI for ContainerProxy? Or
> can
> > it
> > > >> >> just be designated by the ContainerPool impl to use a specific
> > > >> >> ContainerProxy variant? I think these are now and will continue
> to
> > be
> > > >> tied
> > > >> >> closely together, so would manage them as a single SPI.
> > > >> >>
> > > >> >> Thanks
> > > >> >> Tyson
> > > >> >>
> > > >> >> [1] https://na01.safelinks.protection.outlook.com/?url=https%3A%
> > > >> 2F%2Fgithub.com%2Fapache%2Fincubator-openwhisk%2Fpull%
> > > >> 2F2795&data=02%7C01%7Ctnorris%40adobe.com%7C0c84bff555fb4990
> > > >> 142708d5c21dc5e8%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%
> > > >> 7C636628355491109416&sdata=WsPtzIMCpQsGQDML8n1Jm%2BbsBj8BWhw
> > > >> IJ%2B9wiQjkQOE%3D&reserved=0<http
> > > >> >> s://github.com/apache/incubator-openwhisk/pull/2795#pullrequ
> > > >> >> estreview-115830270>
> > > >> >>
> > > >> >> On May 16, 2018, at 10:42 PM, Dominic Kim <style9595@gmail.com
> > > <mailto:
> > > >> st
> > > >> >> yle9595@gmail.com>> wrote:
> > > >> >>
> > > >> >> Dear all.
> > > >> >>
> > > >> >> Does anyone have any comments on this?
> > > >> >> Any comments or opinions would be greatly welcome : )
> > > >> >>
> > > >> >> I think we need around following changes to take this in.
> > > >> >>
> > > >> >> 1. SPI supports for ContainerProxy and ContainerPool ( I already
> > > >> opened PR
> > > >> >> for this:
> > > https://na01.safelinks.protection.outlook.com/?url=https%3A%
> > > >> >> 2F%2Fgithub.com%2Fapache%2Fincubator-openwhisk%2Fpull%
> > > >> >> 2F3663&data=02%7C01%7Ctnorris%40adobe.com%7Cb2dd7c5686bc466f
> > > >> >> 92f708d5bbb8f43d%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%
> > > >> >> 7C636621325405375234&sdata=KYHyOGosc%2BqMIFLVVRO2gYCYZRegfQL
> > > >> >> l%2BfGjSvtGI9k%3D&reserved=0)
> > > >> >> 2. SPI supports for Throttling/Entitlement logics
> > > >> >> 3. New loadbalancer
> > > >> >> 4. New ContainerPool and ContainerProxy
> > > >> >> 5. New notions for limit and logics to hanlde more fine-grained
> > > >> limit.(It
> > > >> >> should be able to coexist with existing limit)
> > > >> >>
> > > >> >> If there's no any comments, I will open a PR based on it one by
> > one.
> > > >> >> Then it would be better for us to discuss on this because we can
> > > >> directly
> > > >> >> discuss over basic implementation.
> > > >> >>
> > > >> >> Thanks
> > > >> >> Regards
> > > >> >> Dominic.
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >> 2018-05-09 11:42 GMT+09:00 Dominic Kim <style9595@gmail.com
> > <mailto:
> > > st
> > > >> >> yle9595@gmail.com>>:
> > > >> >>
> > > >> >> One more thing to clarify is invoker parallelism.
> > > >> >>
> > > >> >> ## Invoker coordinates all requests.
> > > >> >>
> > > >> >> I tend to disagree with the "cannot take advantage of parallel
> > > >> >> processing" bit. Everything in the invoker is parallelized after
> > > >> updating
> > > >> >> its central state (which should take a **very** short amount of
> > time
> > > >> >> relative to actual action runtime). It is not really optimized to
> > > >> scale to
> > > >> >> a lot of containers *yet*.
> > > >> >>
> > > >> >> More precisely, I think it is related to Kafka consumer rather
> than
> > > >> >> invoker. Invoker logic can run in parallel. But `MessageFeed`
> seems
> > > >> not.
> > > >> >> Once outstanding message size reaches max size, it will wait for
> > > >> messages
> > > >> >> processed. If few activation messages for an action are not
> > properly
> > > >> >> handled, `MessageFeed` does not consume more message from
> > Kafka(until
> > > >> any
> > > >> >> messages are processed).
> > > >> >> So subsequent messages for other actions cannot be fetched or
> > delayed
> > > >> due
> > > >> >> to unprocessed messages. This is why I mentioned invoker
> > > parallelism. I
> > > >> >> think I should rephrase it as `MessageFeed` parallelism.
> > > >> >>
> > > >> >> As you know partition is unit of parallelism in Kafka. If we have
> > > >> multiple
> > > >> >> partitions for activation topic, we can setup multiple consumers
> > and
> > > it
> > > >> >> will enable parallel processing for Kafka messages as well.
> > > >> >> Since logics in invoker can already run in parallel, with this
> > > change,
> > > >> we
> > > >> >> can process messages entirely in parallel.
> > > >> >>
> > > >> >> In my proposal, I split activation messages from container
> > > coordination
> > > >> >> message(action parallelism), assign more partition for activation
> > > >> >> messages(in-topic parallelism) and enable parallel processing
> with
> > > >> >> multiple
> > > >> >> consumers(containers).
> > > >> >>
> > > >> >>
> > > >> >> Thanks
> > > >> >> Regards
> > > >> >> Dominic.
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >> 2018-05-08 19:34 GMT+09:00 Dominic Kim <style9595@gmail.com
> > <mailto:
> > > st
> > > >> >> yle9595@gmail.com>>:
> > > >> >>
> > > >> >> Thank you for the response Markus and Christian.
> > > >> >>
> > > >> >> Yes I agree that we need to discuss this proposal in abstract way
> > > >> instead
> > > >> >> in conjunction it with any specific technology because we can
> take
> > > >> better
> > > >> >> software stack if possible.
> > > >> >> Let me answer your questions line by line.
> > > >> >>
> > > >> >>
> > > >> >> ## Does not wait for previous run.
> > > >> >>
> > > >> >> Yes it is valid thoughts. If we keep cumulating requests in the
> > > queue,
> > > >> >> latency can be spiky especially in case execution time of action
> is
> > > >> huge.
> > > >> >> So if we want to take this in, we need to find proper way to
> > balance
> > > >> >> creating more containers for latency and making existing
> containers
> > > >> handle
> > > >> >> requests.
> > > >> >>
> > > >> >>
> > > >> >> ## Not able to accurately control concurrent invocation.
> > > >> >>
> > > >> >> Ok I originally thought this is related to concurrent containers
> > > rather
> > > >> >> than concurrent activations.
> > > >> >> But I am still inclined to concurrent containers approach.
> > > >> >> In current logic, it is dependent on factors other than real
> > > concurrent
> > > >> >> invocations.
> > > >> >> If RTT between controllers and invokers becomes high for some
> > > reasons,
> > > >> >> controller will reject new requests though invokers are actually
> > > idle.
> > > >> >>
> > > >> >> ## TPS is not deterministic.
> > > >> >>
> > > >> >> I meant not deterministic TPS for just one user rather I meant
> > > >> >> system-wide deterministic TPS.
> > > >> >> Surely TPS can vary when heterogenous actions(which have
> different
> > > >> >> execution time) are invoked.
> > > >> >> But currently it's not easy to figure out what the TPS is with
> > only 1
> > > >> >> kind of action because it is changed based on not only
> > heterogeneity
> > > of
> > > >> >> actions but the number of users and namespaces.
> > > >> >>
> > > >> >> I think at least we need to be able to have this kind of official
> > > spec:
> > > >> >> In case actions with 20 ms execution time are invoked, our system
> > TPS
> > > >> is
> > > >> >> 20,000 TPS(no matter how many users or namespaces are used).
> > > >> >>
> > > >> >>
> > > >> >> Your understanding about my proposal is perfectly correct.
> > > >> >> Small thing to add is, controller sends `ContainerCreation`
> request
> > > >> based
> > > >> >> on processing speed of containers rather than availability of
> > > existing
> > > >> >> containers.
> > > >> >>
> > > >> >> BTW, regarding your concern about Kafka topic, I think we may be
> > fine
> > > >> >> because,
> > > >> >> the number of topics will be unbounded, but the number of active
> > > topics
> > > >> >> will be bounded.
> > > >> >>
> > > >> >> If we take this approach, it is mandatory to limit retention
> bytes
> > > and
> > > >> >> duration for each topics.
> > > >> >> So the number of active topics is limited and actual data in them
> > are
> > > >> >> also limited, so I think that would be fine.
> > > >> >>
> > > >> >> But it is necessary to have optimal configurations for retention
> > and
> > > >> many
> > > >> >> benchmark to confirm this.
> > > >> >>
> > > >> >>
> > > >> >> And I didn't get the meaning of eventual consistency of consumer
> > lag.
> > > >> >> You meant that is eventual consistent because it changes very
> > quickly
> > > >> >> even within one second?
> > > >> >>
> > > >> >>
> > > >> >> Thanks
> > > >> >> Regards
> > > >> >> Dominic
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >> 2018-05-08 17:25 GMT+09:00 Markus Thoemmes <
> > > markus.thoemmes@de.ibm.com
> > > >> <ma
> > > >> >> ilto:markus.thoemmes@de.ibm.com>>:
> > > >> >>
> > > >> >> Hey Dominic,
> > > >> >>
> > > >> >> Thank you for the very detailed writeup. Since there is a lot in
> > > here,
> > > >> >> please allow me to rephrase some of your proposals to see if I
> > > >> understood
> > > >> >> correctly. I'll go through point-by-point to try to keep it close
> > to
> > > >> your
> > > >> >> proposal.
> > > >> >>
> > > >> >> **Note:** This is a result of an extensive discussion of
> Christian
> > > >> >> Bickel (@cbickel) and myself on this proposal. I used "I"
> > throughout
> > > >> the
> > > >> >> writeup for easier readability, but all of it can be read as
> "we".
> > > >> >>
> > > >> >> # Issues:
> > > >> >>
> > > >> >> ## Interventions of actions.
> > > >> >>
> > > >> >> That's a valid concern when using today's loadbalancer. This is
> > > >> >> noisy-neighbor behavior that can happen today under the
> > circumstances
> > > >> you
> > > >> >> describe.
> > > >> >>
> > > >> >> ## Does not wait for previous run.
> > > >> >>
> > > >> >> True as well today. The algorithms used until today value
> > correctness
> > > >> >> over performance. You're right, that you could track the expected
> > > queue
> > > >> >> occupation and schedule accordingly. That does have its own risks
> > > >> though
> > > >> >> (what if your action has very spiky latency behavior?).
> > > >> >>
> > > >> >> I'd generally propose to break this out into a seperate
> discussion.
> > > It
> > > >> >> doesn't really correlate to the other points, WDYT?
> > > >> >>
> > > >> >> ## Invoker coordinates all requests.
> > > >> >>
> > > >> >> I tend to disagree with the "cannot take advantage of parallel
> > > >> >> processing" bit. Everything in the invoker is parallelized after
> > > >> updating
> > > >> >> its central state (which should take a **very** short amount of
> > time
> > > >> >> relative to actual action runtime). It is not really optimized to
> > > >> scale to
> > > >> >> a lot of containers *yet*.
> > > >> >>
> > > >> >> ## Not able to accurately control concurrent invocation.
> > > >> >>
> > > >> >> Well, the limits are "concurrent actions in the system". You
> should
> > > be
> > > >> >> able to get 5 activations on the queue with today's mechanism.
> You
> > > >> should
> > > >> >> get as many containers as needed to handle your load. For very
> > > >> >> short-running actions, you might not need N containers to handle
> N
> > > >> >> messages
> > > >> >> in the queue.
> > > >> >>
> > > >> >> ## TPS is not deterministic.
> > > >> >>
> > > >> >> I'm wondering: Have TPS been deterministic for just one user? I'd
> > > argue
> > > >> >> that this is a valid metric on its own kind. I agree that these
> > > numbers
> > > >> >> can
> > > >> >> drop significantly under heterogeneous load.
> > > >> >>
> > > >> >> # Proposal:
> > > >> >>
> > > >> >> I'll try to rephrase and add some bits of abstraction here and
> > there
> > > to
> > > >> >> see if I understood this correctly:
> > > >> >>
> > > >> >> The controller should schedule based on individual actions. It
> > should
> > > >> >> not send those to an arbitrary invoker but rather to something
> that
> > > >> >> identifies those actions themselves (a kafka topic in your
> > example).
> > > >> I'll
> > > >> >> call this *PerActionContainerPool*. Those calls from the
> controller
> > > >> will
> > > >> >> be
> > > >> >> handled by each *ContainerProxy* directly rather than being
> > threaded
> > > >> >> through another "centralized" component (the invoker). The
> > > >> >> *ContainerProxy*
> > > >> >> is responsible for handling the "aftermath": Writing activation
> > > >> records,
> > > >> >> collecting logs etc (like today).
> > > >> >>
> > > >> >> Iff the controller thinks that the existing containers cannot
> > sustain
> > > >> >> the load (i.e. if all containers are currently in use), it
> advises
> > a
> > > >> >> *ContainerCreationSystem* (all invokers combined in your case) to
> > > >> create a
> > > >> >> new container. This container will be added to the
> > > >> >> *PerActionContainerPool*.
> > > >> >>
> > > >> >> The invoker in your proposal has no scheduling logic at all
> (which
> > is
> > > >> >> sound with the issues lined out above) other than container
> > creation
> > > >> >> itself.
> > > >> >>
> > > >> >> # Conclusion:
> > > >> >>
> > > >> >> I like the proposal in the abstract way I've tried to phrase
> above.
> > > It
> > > >> >> indeed amplifies warm-container usage and in general should be
> > > >> superior to
> > > >> >> the more statistical approach of today's loadbalancer.
> > > >> >>
> > > >> >> I think we should discuss this proposal in an abstract,
> > > >> >> non-technology-bound way. I do think that having so many kafka
> > topics
> > > >> >> including all the rebalancing needed can become an issue,
> > especially
> > > >> >> because the sheer number of kafka topics is unbounded. I also
> think
> > > >> that
> > > >> >> the consumer lag is subject to eventual consistency and depending
> > on
> > > >> how
> > > >> >> eventual that is it can turn into queueing in your system, even
> > > though
> > > >> >> that
> > > >> >> wouldn't be necessary from a capacity perspective.
> > > >> >>
> > > >> >> I don't want to ditch the proposal because of those concerns
> > though!
> > > >> >>
> > > >> >> As I said: The proposal itself makes a lot of sense and I like
> it a
> > > >> lot!
> > > >> >> Let's not trap ourselves in the technology used today though.
> > You're
> > > >> >> proposing a major restructuring so we might as well think more
> > > >> >> green-fieldy. WDYT?
> > > >> >>
> > > >> >> Cheers,
> > > >> >> Christian and Markus
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message