incubator-blur-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron McCurry <amccu...@gmail.com>
Subject Re: Shard takeover behavior
Date Sat, 08 Mar 2014 15:26:43 GMT
On Fri, Mar 7, 2014 at 5:42 AM, Ravikumar Govindarajan <
ravikumar.govindarajan@gmail.com> wrote:

> >
> > Well it works that way for OOMs and for when the process drop hard (Think
> > kill -9).  However when a shard server is shutdown it currently ends it's
> > session in ZooKeeper, thus triggering a layout change.
>
>
> Yes, may be we can have a config to determine whether it shud end/maintain
> the session in ZK when doing a normal shutdown and then subsequent restart.
> By this way, both MTTR-conscious and layout-conscious settings can be
> supported.
>

That's a neat idea.  Once we have shards being served on multiple servers
we should definitely take a look at this.  When we implement the
multi-shard serving I would guess that there will be 2 layout strategies
(they might be implemented together).

1. Would be to get the N replicas online on different servers.
2. Would the writing leader for the shard, assuming that it's needed.


>
> How do you think we can detect that a particular shard-server is
> struggling/shut-down and hence incoming search-requests need to go to some
> other server?
>
> I am listing few paths off the top of my head
>
> 1. Process baby-sitters like supervisord, alerting controllers
> 2. Tracking first network-exception in controller and diverting to
> read-only
>     instance. Periodically may be re-try
> 3. Take a statistics based decision, based on previous response times etc..
>

Anding to this one and this may be obvious but measuring the response time
in comparison with other shards.  Meaning if the entire cluster is
experiencing an increase in load and all responses times are increasing we
wouldn't want to start killing off shard servers inadvertently.  Looking
for outliers.


> 4. Build some kind of leasing mechanism in ZK etc...
>

I think that all of these are good approaches.  Likely to determine that a
node is misbehaving and should be killed/not used anymore we would want
multiple ways to measure that condition and then vote on the need kick out.


Aaron

>
> --
> Ravi
>
>
> On Fri, Mar 7, 2014 at 8:01 AM, Aaron McCurry <amccurry@gmail.com> wrote:
>
> > On Thu, Mar 6, 2014 at 6:30 AM, Ravikumar Govindarajan <
> > ravikumar.govindarajan@gmail.com> wrote:
> >
> > > I came to know about zk.session.timeout variable just now, while
> reading
> > > more about this problem.
> > >
> > > This will only trigger dead-node notification after the configured
> > timeout
> > > exceeds. Setting it to 3-4 mins must be fine for OOMs and
> > rolling-restarts.
> > >
> >
> > Well it works that way for OOMs and for when the process drop hard (Think
> > kill -9).  However when a shard server is shutdown it currently ends it's
> > session in ZooKeeper, thus triggering a layout change.
> >
> >
> > >
> > > Only extra stuff I am looking for, is to divert search calls to a
> > read-only
> > > shard instance during this 3-4 mins time to avoid mini-outages
> > >
> >
> > Yes, and I think that the controllers will automatically spread the
> queries
> > across those servers that are online.  The BlurClient class already
> takes a
> > list of connection strings and treats all connections as equals.  For
> > example, it's current use is to provide the client with all the
> controllers
> > connection strings.  Internally if any one of the controllers goes down
> or
> > has a network issue another controller is automatically retried without
> the
> > user having to do anything.  There is back off, ping, and pooling logic
> in
> > the BlurClientManager that the BlurClient utilizes.
> >
> > Aaron
> >
> >
> > >
> > > --
> > > Ravi
> > >
> > >
> > >
> > > On Thu, Mar 6, 2014 at 3:34 PM, Ravikumar Govindarajan <
> > > ravikumar.govindarajan@gmail.com> wrote:
> > >
> > > > What do you think of giving an extra leeway for shard-server
>  failover
> > > > cases?
> > > >
> > > > Ex: Whenever a shard-server process gets killed, the controller-node
> > does
> > > > not immediately update-layout, but rather mark it as a suspect.
> > > >
> > > > When we have a read-only back-up of shard, searches can continue
> > > > unhindered. Indexing during this time can be diverted to a queue,
> which
> > > > will store and retry-ops, when shard-server comes online again.
> > > >
> > > > Over configured number of attempts/time, if the shard-server does not
> > > come
> > > > up, then one controller-server can authoritatively mark it as down
> and
> > > > update the layout.
> > > >
> > > > --
> > > > Ravi
> > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message