mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhitao Li <zhitaoli...@gmail.com>
Subject Re: Surfacing additional issues on agent host to schedulers
Date Wed, 21 Feb 2018 19:18:14 GMT
Hi Avinash,

We use haproxy of all outgoing traffic. For example, if instance of service
A wants to talk to service B, what it does is actually call a
"localhost:<some-port>" backed by the local haproxy instance, which then
forwards the request to some instance of service B.

In such a situation, if local haproxy is not functional, it's almost true
that any thing making outgoing requests will not run properly, and we
prefer to drain the host.

On Wed, Feb 21, 2018 at 9:45 AM, Avinash Sridharan <avinash@mesosphere.io>
wrote:

> On Tue, Feb 20, 2018 at 3:54 PM, James Peach <jorgar@gmail.com> wrote:
>
> >
> > > On Feb 20, 2018, at 11:11 AM, Zhitao Li <zhitaoli.cs@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > In one of recent Mesos meet up, quite a couple of cluster operators had
> > > expressed complaints that it is hard to model host issues with Mesos at
> > the
> > > moment.
> > >
> > > For example, in our environment, the only signal scheduler would know
> is
> > > whether Mesos agent has disconnected from the cluster. However, we
> have a
> > > family of other issues in real production which makes the hosts
> > (sometimes
> > > "partially") unusable. Examples include:
> > > - traffic routing software malfunction (i.e, haproxy): Mesos agent does
> > not
> > > require this so scheduler/deployment system is not aware, but actual
> > > workload on the cluster will fail;
> >
> Zhitao, could you elaborate on this a bit more? Do you mean the workloads
> are being load-balanced by HAProxy and due to misconfiguration the
> workloads are now unreachable and somehow the agent should be boiling up
> these network issues? I am guessing in your case HAProxy is somehow
> involved in providing connectivity to workloads on a given agent and
> HAProxy is actually running on that agent?
>
>
> > > - broken disk;
> > > - other long running system agent issues.
> > >
> > > This email is looking at how can Mesos recommend best practice to
> surface
> > > these issues to scheduler, and whether we need additional primitives in
> > > Mesos to achieve such goal.
> >
> > In the K8s world the node can publish "conditions" that describe its
> status
> >
> >         https://kubernetes.io/docs/concepts/architecture/nodes/#
> condition
> >
> > The condition can automatically taint the node, which could cause pods to
> > automatically be evicted (ie. if they can't tolerate that specific
> taint).
> >
> > J
>
>
>
>
> --
> Avinash Sridharan, Mesosphere
> +1 (323) 702 5245
>



-- 
Cheers,

Zhitao Li

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message