kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Attila Bukor <abu...@apache.org>
Subject Re: How to decrease kudu server restart time
Date Mon, 13 Aug 2018 09:16:09 GMT
Hi Gary,

Please find my answers inline.

On Sun, Aug 12, 2018 at 01:17:03PM +0800, Gary Gao wrote:
> I have a kudu cluster of 40 nodes, when I realized that
> maintenance_manager_num_threads=1 is too small, I updated config file and
> restarted a kudu tablet server, but it took too long to start, longer than
> --follower_unavailable_considered_failed_sec=600, causing tablet
> redistribution.
> Even if the kudu server started, it also spent too much copying tablet, as
> the following tablet block copying log:
> 
> 
> Tablet 1ecbe230e14a4d9f9125dbc49c32860e of table
> 'impala::venus.ods_xk_pay_fee_order' is under-replicated: 1 replica(s) not
> RUNNING
>   41e4489d38924c85a4810bd33ef60d80 (bj-yz-hadoop01-1-12:7050): bad state
>     State:       INITIALIZED
>     Data state:  TABLET_DATA_COPYING
>     Last status: Tablet Copy: Downloading block 0000000084111077
> (299837/1177225)
>   52a9ede038a04566860ecd2e54388738 (bj-yz-hadoop01-1-51:7050): RUNNING
>   b133f6fd0c274b93b21ffcbdcbbde830 (bj-yz-hadoop01-1-14:7050): RUNNING
> [LEADER]
> 

Which version are you using? The recent versions are using 3-4-3 replica
replacement, meaning the tablet copy should be automatically canceled
when the third replica comes online and the copy hasn't finished yet.

> 
> My Question are:
> 
> 1. It seems kudu server spent a long time to open log block container, how
> to speed up restarting kudu server ?

The startup time of the tablet servers mostly depends on the number of
tablets hosted on the server. I'm not sure if there's any way to tune
it, aside from reducing the number of tablets. How many tablets do you
have per tablet server?

> 
> 2. I think the number of blocks have an influence on kudu server restarting
> time and query time on specific tablet, more number of blocks, more
> restarting time and query time. Is this right ?

I'm not sure how much the number of blocks influences the restart time,
maybe someone else can shed some light on this one. I'd focus on the
number of tablets though.

The query latencies depend on how many blocks the server needs to read
from, but it's a matter of how well the data is compacted (either by
sequential writes instead of random writes, or whether the maintenance
managers compacted them), rather than the number of total blocks.

> 
> 3. Why there are more than 1 million blocks in a tablet, as shown in above
> Tablet Copy log, while there are less than 500 thousands of records in the
> tablet ?
> 

Each rowset will have multiple blocks (one per column, UNDO and
REDO deltas, and bloom filters). The number of rowsets depends on the
number of rows.

> 4. How to reduce the number of block in tablet ?

The maintenance managers perform compactions that reduce the number of
blocks per tablets. Other than this, less columns or less rows also
results in less blocks of course.

- Attila

Mime
View raw message