kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Degrave <dmee...@gmail.com>
Subject Underutilization of hardware resources with smaller number of Tservers
Date Fri, 12 Jul 2019 16:18:45 GMT
Hi All,

We have a small 1.9.0 Kudu cluster for pre-production testing with
rather powerful nodes. Each node has:

28 cores with HT (total 56 logical cores)
Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz

RAM: 503GiB
10 NVME drives

We run OpenStack VMs on top of them, 1 VM per physical host. Each VM
has 50 cores, 500GiB RAM and pci passthrough access to 4 NVMe drives.

It is expected to see the same performance with a single tserver per
VM comparing with multiple tservers per VM (giving the total number of
MM threads per VM is equal) - based on discussion we had on
getkudu/kudu-general channel.

This is not what we actually have seen during the last months. Based
on our experience, more tservers utilize hardware resources better and
give better overall ETL performance than less tservers.

We are trying to understand the reasons.

To demonstrate the difference, below are results and configuration for
running cluster on 2 VMs, each on separate physical host, with 4 and 8
tservers per VM and 17 maintenance manager threads per VM total. We
run a single master per VM in both cases with identical config.

We see the same pattern with more VMs in cluster and with different
number of tservers per VM (keeping total number of maintenance manager
threads per VM equal) - the more tservers per VM, the better
performance (till certain limit when we reach h/w capacity).

Starting 4 tservers per VM:
https://gist.github.com/dnafault/96bcac24974ea4e50384ecd22161510d

Starting 8 tservers per VM:
https://gist.github.com/dnafault/a16ae893b881b44d95378b5e4a056862

Starting 1 master per VM:
https://gist.github.com/dnafault/87ec3565e36b0a88568f8b86f37f9521

In both cases we start ETL from Spark node on separate physical host
with identical hardware resources as above. ETL is finished in 910 min
with 4 tservers and in 830 min with 8 tservers.

Looking at how hardware is utilized, we can see that all resources are
utilized less with 4 tservers:

- CPUs (50 cores on VM):
    4 tservers load avg:
https://user-images.githubusercontent.com/20377386/60939960-21ebee80-a31d-11e9-911a-1e5cc2abcb26.png
    8 tservers load avg:
https://user-images.githubusercontent.com/20377386/60940404-b3a82b80-a31e-11e9-8ce7-2f843259bf91.png

- Network:
    4 tservers, kb:
https://user-images.githubusercontent.com/20377386/60940071-96bf2880-a31d-11e9-8aa6-6decaa3f651a.png
    8 tservers, kb:
https://user-images.githubusercontent.com/20377386/60940503-0bdf2d80-a31f-11e9-85de-f5e4b0959c1b.png

    4 tservers, pkts:
https://user-images.githubusercontent.com/20377386/60940063-92930b00-a31d-11e9-81cf-b7e9f05e53ec.png
    8 tservers, pkts:
https://user-images.githubusercontent.com/20377386/60940441-d0dcfa00-a31e-11e9-8c03-330024339a34.png

- NVMe drives:
    4 tservers:
https://user-images.githubusercontent.com/20377386/60940249-2238b980-a31e-11e9-8fff-2233ce18f74d.png
    8 tservers:
https://user-images.githubusercontent.com/20377386/60940534-2e714680-a31f-11e9-956e-76c175de6bfa.png

Schema in our tests:
https://gist.github.com/dnafault/e55ea987c55d2960c738d94e4811d043

Variant table has 9*10^7 rows, GT table has 1.4*10^10 rows.

Manuel reproduced the same pattern with synthetic tests on the same
cluster (all configs use 17MM threads total per VM). Test load is
generated from external VM (no kudu node on it) with:

seq 20 | xargs -L1 -I{} -P4 time ./kudu perf loadgen kudu-1,kudu-3
--num_rows_per_thread=50000000 --num_threads=6

Time to finish tests:

    4 tservers: 18 min 11 sec
    8 tservers: 16 min  4 sec
   16 tservers: 15 min 23 sec

We want to understand what are the factors which do not allow a
smaller # tservers to give a similar performance as the larger #
tservers. Is there anything we miss in configuration ?

thanks,
dmitry

Mime
View raw message