openwhisk-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dominic Kim <style9...@gmail.com>
Subject Re: New architecture proposal
Date Mon, 08 Apr 2019 02:09:53 GMT
Hi Mingyu

Thank you for the good questions.

Before answering to your question, I will share the Lease in ETCD first.
ETCD has a data model which is disappear after given time if there is no
relevant keepalive on it, the Lease.

So once you grant a new lease, you can put it with data in each operation
such as put, putTxn(transaction), etc.
If there is no keep-alive for the given(configurable) time, inserted data
will be gone.

In my proposal, most of data in ETCD rely on a lease.
For example, each scheduler stores their endpoint information(for queue
creation) with a lease. Each queue stores their information(for activation)
in ETCD with a lease.
(It is overhead to do keep-alive in each memory queue separately, I
introduced EtcdKeepAliveService to share one global lease among all queues
in a same scheduler.)
Each ContainerProxy store their information in ETCD with a lease so that
when a queue tries to create a container, they can easily count the number
of existing containers with "Count" API.
Both data are stored with a lease, if one scheduler or invoker are failed,
keep-alive for the given lease is not continued, and finally those data
will be removed.

Follower queues are watching on the leader queue information. If there are
any changes,(the data will be removed upon scheduler failure) they can
receive the notification and start new leader election.
When a scheduler is failed, ContainerProxys which were communicating with a
queue in that scheduler, will receive a connection error.
Then they are designed to access to the ETCD again to figure out the
endpoint of the leader queue.
As one of followers becomes a new leader, ContainerProxys can connect to
the new leader.

One thing to note here is, there is only one QueueManager in each scheduler.
One QueueManager holds all queues and delegate requests to the proper queue
in respond to "fetch" requests.

In short, all endpoints data are stored in ETCD and they are renewed based
on keep-alive and lease.
Each components are designed to access ETCD when the failure detected and
connect to a new(failed-over) scheduler.

I hope it is useful to you.
And I think when I and my colleagues open PRs, we need to add more detail
design along with them.

If you have any further questions, kindly let me know.

Thanks
Best regards
Dominic



2019년 4월 6일 (토) 오전 11:28, Mingyu Zhou <zhoumy46@gmail.com>님이 작성:

> Dear Dominic,
>
> Thanks for your proposal. It is very inspirational and it looks promising.
>
> I'd like to ask some questions about the fall over/failure recovery
> mechanism of the scheduler component.
>
> IIUC, a scheduler instance hosts multiple queue managers. If a scheduler is
> down, we will lose multiple queue managers. Thus, there should be some form
> of failure recovery of queue managers and it raises the following
> questions:
>
> 1. In your proposal, how the failure of a scheduler is detected? I.e.,
> when a scheduler instance is down and some queue manager become
> unreachable, which component will be aware of this unavailability and then
> trigger the recovery procedure?
>
> 2. How should the failure be recovered and lost queue managers be brought
> back to life? Specifically, in your proposal, you designed a hot
> standing-by pairing of queue managers (one leader/two followers). Then how
> should the new leader be selected in face of scheduler crash? And do we
> need to allocate a new queue manager to maintain the
> one-leader-two-follower configuration?
>
> 3. How will the other components in the system learn the new configuration
> after a fall over? For example, how will the pool balancer discover the new
> state of the scheduler it managers and change its policy to distribute
> queue creation requests?
>
> Thanks
> Mingyu Zhou
>
> On Fri, Apr 5, 2019 at 2:56 PM Dominic Kim <style9595@gmail.com> wrote:
>
> > Dear David, Matt, and Dascalita.
> > Thank you for your interest in my proposal.
> >
> > Let me answer your questions one by one.
> >
> > @David
> > Yes, I will(and actually already did) implement all components based on
> > SPI.
> > The reason why I said "breaking changes" is, my proposal will affect most
> > of components drastically.
> > For example, InvokerReactive will become a SPI and current
> InvokerReactive
> > will become one of its concrete implementation.
> > My load balancer and throttler are also based on the current SPI.
> > So though my implementation would be included in OpenWhisk, downstreams
> > still can take advantage of existing implementation such as
> > ShardingPoolBalancer.
> >
> > Regarding Leader/Follower, a fair point.
> > The reason why I introduced such a model is to prepare for the future
> > enhancement.
> > Actually, I reached a conclusion that memory based activation passing
> would
> > be enough for OpenWhisk in terms of message persistence.
> > But it is just my own opinion and community may want more rigid level of
> > persistence.
> > I naively thought we can add replication and HA logic in the queue which
> > are similar to the one in Kafka.
> > And Leader/Follower would be a good base building block for this.
> >
> > Currently only a leader fetch activation messages from Kafka. Followers
> > will be idle while watching the leadership change.
> > Once the leadership is changed, one of followers will become a new leader
> > and at that time, Kafka consumer for the new leader will be created.
> > This is to minimize the failure handling time in the aspect of clients as
> > you mentioned. It is also correct that this flow does not prevent
> > activation messages lost on scheduler failure.
> > But it's not that complex as activation messages are not replicated to
> > followers and the number of followers are configurable.
> > If we want, we can configure the number of required queue to 1, there
> will
> > be only one leader queue.
> > (If we ok with the current level of persistence, I think we may not need
> > more than 1 queue as you said.)
> >
> > Regarding pulling activation messages, each action will have its own
> Kafka
> > topic.
> > It is same with what I proposed in my previous proposals.
> > When an action is created, a Kafka topic for the action will be created.
> > So each leader queue(consumer) will fetch activation messages from its
> own
> > Kafka topic and there would be no intervention among actions.
> >
> > When I and my colleagues open PRs for each component, we will add detail
> > component design.
> > It would help you guys understand the proposal more.
> >
> > @Matt
> > Thank you for the suggestion.
> > If I change the name of it now, it would break the link in this thread.
> > I would use the name you suggested when I open a PR.
> >
> >
> > @Dascalita
> >
> > Interesting idea.
> > Any GC patterns do you keep in your mind to apply in OpenWhisk?
> >
> > In my proposal, the container GC is similar to what OpenWhisk does these
> > days.
> > Each container will autonomously fetch activations from the queue.
> > Whenever they finish invocation of one activation, they will fetch the
> next
> > request and invoke it.
> > In this sense, we can maximize the container reuse.
> >
> > When there is no more activation message, ContainerProxy will be wait for
> > the given time(configurable) and just stop.
> > One difference is containers are no more paused, they are just removed.
> > Instead of pausing them, containers are waiting for subsequent requests
> for
> > longer time(5~10s) than current implementation.
> > This is because pausing is also relatively expensive operation in
> > comparison to short-running invocation.
> >
> > Container lifecycle is managed in this way.
> > 1. When a container is created, it will add its information in ETCD.
> > 2. A queue will count the existing number of containers using above
> > information.
> > 3. Under heavy loads, the queue will request more containers if the
> number
> > of existing containers is less than its resource limit.
> > 4. When the container is removed, it will delete its information in ETCD.
> >
> >
> > Again, I really appreciate all your feedbacks and questions.
> > If you have any further questions, kindly let me know.
> >
> > Best regards
> > Dominic
> >
> >
> >
> > 2019년 4월 5일 (금) 오전 1:24, Dascalita Dragos <ddragosd@gmail.com>님이
작성:
> >
> > > Hi Dominic,
> > > Thanks for sharing your ideas. IIUC (and pls keep me honest), the goal
> of
> > > the new design is to improve activation performance. I personally love
> > > this; performance is a critical non-functional feature of any FaaS
> > system.
> > >
> > > There’s something I’d like to call out: the management of containers
> in a
> > > FaaS system could be compared to a JVM. A JVM allocates objects in
> > memory,
> > > and GC them. A FaaS system allocates containers to run actions, and it
> > GCs
> > > them when they become idle. If we could look at OW's scheduling from
> this
> > > perspective, we could reuse the proven patterns in the JVM vs inventing
> > > something new. I’d be interested on any GC implications in the new
> > design,
> > > meaning how idle actions get removed, and how is that orchestrated.
> > >
> > > Thanks,
> > > dragos
> > >
> > >
> > > On Thu, Apr 4, 2019 at 8:40 AM Matt Sicker <boards@gmail.com> wrote:
> > >
> > > > Would it make sense to define an OpenWhisk Improvement/Enhancement
> > > > Propoposal or similar that various other Apache projects do? We could
> > > > call them WHIPs or something. :)
> > > >
> > > > On Thu, 4 Apr 2019 at 09:09, David P Grove <groved@us.ibm.com>
> wrote:
> > > > >
> > > > >
> > > > > Dominic Kim <style9595@gmail.com> wrote on 04/04/2019 04:37:19
AM:
> > > > > >
> > > > > > I have proposed a new architecture.
> > > > > >
> > > https://cwiki.apache.org/confluence/display/OPENWHISK/New+architecture
> > > > > +proposal
> > > > > >
> > > > > > It includes many controversial agendas and breaking changes.
> > > > > > So I would like to form a general consensus on them.
> > > > > >
> > > > >
> > > > > Hi Dominic,
> > > > >
> > > > >         There's much to like about the proposal.  Thank you for
> > writing
> > > > it
> > > > > up.
> > > > >
> > > > >         One meta-comment is that the work will have to be done in
a
> > way
> > > > so
> > > > > there are no actual "breaking changes".  It has to be possible to
> > > > continue
> > > > > to configure the system using the existing architectures while this
> > > work
> > > > > proceeds.  I would expect this could be done via a new LoadBalancer
> > and
> > > > > some deployment options (similar to how Lean OpenWhisk was done).
> If
> > > > work
> > > > > needs to be done to generalize the LoadBalancer SPI, that could be
> > done
> > > > > early in the process.
> > > > >
> > > > >         On the proposal itself, I wonder if the complexity of
> > > > Leader/Follower
> > > > > is actually needed?  If a Scheduler crashes, it could be restarted
> > and
> > > > then
> > > > > resume handling its assigned load.  I think there should be enough
> > > > > information in etcd for it to recover its current set of assigned
> > > > > ContainerProxys and carry on.   Activations in its in memory queues
> > > would
> > > > > be lost (bigger blast radius than the current architecture), but
I
> > > don't
> > > > > see that the Leader/Follower changes that (seems way too expensive
> to
> > > be
> > > > > replicating every activation in the Follower Queues).   The
> > > > Leader/Follower
> > > > > would allow for shorter downtime for those actions assigned to the
> > > downed
> > > > > Scheduler, but at the cost of significant complexity.  Is it worth
> > it?
> > > > >
> > > > >         Perhaps related to the Leader/Follower, its not clear to
me
> > how
> > > > > activation messages are being pulled from the action topic in Kafka
> > > > during
> > > > > the Queue creation window. I think they have to go somewhere
> (because
> > > the
> > > > > is a mix of actions on a single Kafka topic and we can't stall
> other
> > > > > actions while waiting for a Queue to be created for a new action),
> > but
> > > if
> > > > > you don't know yet which Scheduler is going to win the race to be
a
> > > > Leader
> > > > > how do you know where to put them?
> > > > >
> > > > > --dave
> > > >
> > > >
> > > >
> > > > --
> > > > Matt Sicker <boards@gmail.com>
> > > >
> > >
> >
>
>
> --
> 周明宇
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message