From Adar Lieber-Dembo <a...@cloudera.com>
Subject Re: How to decrease kudu server restart time
Date Mon, 13 Aug 2018 21:40:52 GMT
> Even if the kudu server started, it also spent too much copying tablet, as the following
tablet block copying log:
> Tablet 1ecbe230e14a4d9f9125dbc49c32860e of table 'impala::venus.ods_xk_pay_fee_order'
is under-replicated: 1 replica(s) not RUNNING
>   41e4489d38924c85a4810bd33ef60d80 (bj-yz-hadoop01-1-12:7050): bad state
>     State:       INITIALIZED
>     Data state:  TABLET_DATA_COPYING
>     Last status: Tablet Copy: Downloading block 0000000084111077 (299837/1177225)
>   52a9ede038a04566860ecd2e54388738 (bj-yz-hadoop01-1-51:7050): RUNNING
>   b133f6fd0c274b93b21ffcbdcbbde830 (bj-yz-hadoop01-1-14:7050): RUNNING [LEADER]

I see that this tablet has over a million blocks, but how are you
measuring that it's spending too much time copying? How much time did
it take to fully copy this tablet?

> 1. It seems kudu server spent a long time to open log block container, how to speed up
restarting kudu server ?

Your Kudu server log should contain some log messages that'll help us
understand what's going on. Look for a message like "Time spent
opening block manager" and paste that.  Also can you find and paste
the "FS layout report"?

In general, the more blocks (and thus block containers) you have, the
longer it'll take Kudu to restart. KUDU-2014 has some ideas on how we
might improve this.

Once a tserver is deemed dead and its data is rereplicated elsewhere,
you can just reformat the node (i.e. delete the contents of the WAL,
metadata, and data directories). Its contents are no longer necessary,
and this will reset the number of log block containers to 0, which
will speed up subsequent restarts.

> 2. I think the number of blocks have an influence on kudu server restarting time and
query time on specific tablet, more number of blocks, more restarting time and query time.
Is this right ?

Yes to restarting time, but not necessarily to query time. It really
depends on the kinds of queries you're issuing, how many predicates
they have, etc.

> 3. Why there are more than 1 million blocks in a tablet, as shown in above Tablet Copy
log, while there are less than 500 thousands of records in the tablet ?

That's an excellent question. What kind of write workload do you have?
What's your table schema and partitioning? Do you have any
non-standard flags defined that may affect how Kudu flushes or
compacts its data?

I'd also suggest running the CLI tool 'kudu local_replica data_size'
on that large replica you described above. It will help identify
whether this is a case of very large tablets, or just high numbers of

> 4. How to reduce the number of block in tablet ?

Once you answer the questions I posed just above, I might be able to
offer some recommendations for how to reduce the overall number of

