kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Feature request for Kudu 1.3.0
Date Sat, 11 Feb 2017 00:38:55 GMT
Inline below as well.

On Fri, Feb 10, 2017 at 1:23 PM, Weber, Richard <riweber@akamai.com> wrote:

>
>
>
> On Fri, Feb 10, 2017 at 10:32 AM, Weber, Richard <riweber@akamai.com>
> wrote:
>
> I definitely would push for prioritization on this.
>
>
>
> Our main use case is less about multiple racks and failure, and more about
> functionality during the install process.  Our clusters are installed in
> logical regions, and we install 1/3 of a region at a time.  That means 1/3
> of the cluster can be down for the SW install, reboot, or something else.
> Allowing rack locality to be logically defined will allow the data to still
> be available during normal maintenance operations.
>
>
>
> That's an interesting use case. How long is the 1/3rd of the cluster
> typically down for? I'd be afraid that, if it's down for more than a couple
> minutes, there's a decent chance of losing one server in the other 2/3
> region, which would leave a tablet at 1/3 replication and unavailable for
> writes or consistent reads. Is that acceptable for your target use cases?
>
>
>
> Nodes would be down typically for 5-15 minutes or so.  Are you saying that
> if 1 node goes down, there's an increased chance of one of the other 2
> going down as well?
>

Not that it increases the chances of the other two going down, but it does
increase the impact.


> That doesn't sound good if losing a node increases the instability of the
> system.  Additionally, wouldn't the tablets start re-replicating the data
> if 2/3 of the nodes detect the node is down for too long?
>

Yep - the default setting is 5 minutes iirc.


>
>
> How does the system typically handle a node failing?  Is re-replication of
> data not automatic?  (I haven't experimented with this enough)
>
>
>

Right - after 5 minutes, the leader replica of a tablet will decide that a
node is dead, and evict it. The master will then notice that it's
under-replicated and make a new replica.

There's a design we're working on to make it so that, instead of evicting
the presumed-dead replica, it would recruit the new 4th replica first and
get it online. That way, if the "dead" one comes back, it can rejoin
transparently without having to wait for the full new copy to be made.


> Our install process is along the line of:
>
> 1)      copy software to target machine
>
> 2)      shut down services on machine
>
> 3)      expand software to final location
>
> 4)      reboot (if new kernel)
>
> 5)      restart services.
>

OK, hopefully that happens quickly usually. I've seen other orgs try this
and have issues where their restart ends up running fsck on 12x4TB drives
and the restart takes an hour, though :)

-Todd

-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
View raw message