kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adar Lieber-Dembo <a...@cloudera.com>
Subject Re: Underutilization of hardware resources with smaller number of Tservers
Date Fri, 12 Jul 2019 17:54:13 GMT
Thanks for the detailed summary and analysis. I want to make sure I
understand the overall cluster topology. You have two physical hosts,
each running one VM, with each VM running either 4 or 8 tservers. Is
that correct?

Here are some other questions:
- Have you verified that your _client_ machines (i.e. the ETL job
machine or the machine running loadgen) aren't fully saturated? If it
is, that would explain why the server machines aren't fully utilized.
You mention adding more and more tservers until the hardware is
saturated so I imagine the client machines aren't the bottleneck, but
it's good to check.
- How is the memory utilization? Are the VMs overcommitted? Is there
any swapping? I noticed you haven't configured a memory limit for the
tservers, which means each will try to utilize 80% of the available
RAM on the machine. That's significant overcommitment when there are
multiple tservers in each VM.
- The tserver configuration shows three data directories per NVMe. Why
not just one data directory per NVMe?
- The schema is helpful, but it's not clear how many hash buckets are
in use. I imagine you vary that based on the number of tservers, but
it'd be good to know the details here.

On an unrelated note, you mentioned using one master but the
configurations show a cluster with two. That's probably not related to
the performance issue, but you should know that a 2-master
configuration can't tolerate any faults; you should use 3 masters if
your goal is tolerate the loss of a master.

On Fri, Jul 12, 2019 at 9:19 AM Dmitry Degrave <dmeetry@gmail.com> wrote:
> Hi All,
> We have a small 1.9.0 Kudu cluster for pre-production testing with
> rather powerful nodes. Each node has:
> 28 cores with HT (total 56 logical cores)
> Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
> RAM: 503GiB
> 10 NVME drives
> We run OpenStack VMs on top of them, 1 VM per physical host. Each VM
> has 50 cores, 500GiB RAM and pci passthrough access to 4 NVMe drives.
> It is expected to see the same performance with a single tserver per
> VM comparing with multiple tservers per VM (giving the total number of
> MM threads per VM is equal) - based on discussion we had on
> getkudu/kudu-general channel.
> This is not what we actually have seen during the last months. Based
> on our experience, more tservers utilize hardware resources better and
> give better overall ETL performance than less tservers.
> We are trying to understand the reasons.
> To demonstrate the difference, below are results and configuration for
> running cluster on 2 VMs, each on separate physical host, with 4 and 8
> tservers per VM and 17 maintenance manager threads per VM total. We
> run a single master per VM in both cases with identical config.
> We see the same pattern with more VMs in cluster and with different
> number of tservers per VM (keeping total number of maintenance manager
> threads per VM equal) - the more tservers per VM, the better
> performance (till certain limit when we reach h/w capacity).
> Starting 4 tservers per VM:
> https://gist.github.com/dnafault/96bcac24974ea4e50384ecd22161510d
> Starting 8 tservers per VM:
> https://gist.github.com/dnafault/a16ae893b881b44d95378b5e4a056862
> Starting 1 master per VM:
> https://gist.github.com/dnafault/87ec3565e36b0a88568f8b86f37f9521
> In both cases we start ETL from Spark node on separate physical host
> with identical hardware resources as above. ETL is finished in 910 min
> with 4 tservers and in 830 min with 8 tservers.
> Looking at how hardware is utilized, we can see that all resources are
> utilized less with 4 tservers:
> - CPUs (50 cores on VM):
>     4 tservers load avg:
> https://user-images.githubusercontent.com/20377386/60939960-21ebee80-a31d-11e9-911a-1e5cc2abcb26.png
>     8 tservers load avg:
> https://user-images.githubusercontent.com/20377386/60940404-b3a82b80-a31e-11e9-8ce7-2f843259bf91.png
> - Network:
>     4 tservers, kb:
> https://user-images.githubusercontent.com/20377386/60940071-96bf2880-a31d-11e9-8aa6-6decaa3f651a.png
>     8 tservers, kb:
> https://user-images.githubusercontent.com/20377386/60940503-0bdf2d80-a31f-11e9-85de-f5e4b0959c1b.png
>     4 tservers, pkts:
> https://user-images.githubusercontent.com/20377386/60940063-92930b00-a31d-11e9-81cf-b7e9f05e53ec.png
>     8 tservers, pkts:
> https://user-images.githubusercontent.com/20377386/60940441-d0dcfa00-a31e-11e9-8c03-330024339a34.png
> - NVMe drives:
>     4 tservers:
> https://user-images.githubusercontent.com/20377386/60940249-2238b980-a31e-11e9-8fff-2233ce18f74d.png
>     8 tservers:
> https://user-images.githubusercontent.com/20377386/60940534-2e714680-a31f-11e9-956e-76c175de6bfa.png
> Schema in our tests:
> https://gist.github.com/dnafault/e55ea987c55d2960c738d94e4811d043
> Variant table has 9*10^7 rows, GT table has 1.4*10^10 rows.
> Manuel reproduced the same pattern with synthetic tests on the same
> cluster (all configs use 17MM threads total per VM). Test load is
> generated from external VM (no kudu node on it) with:
> seq 20 | xargs -L1 -I{} -P4 time ./kudu perf loadgen kudu-1,kudu-3
> --num_rows_per_thread=50000000 --num_threads=6
> Time to finish tests:
>     4 tservers: 18 min 11 sec
>     8 tservers: 16 min  4 sec
>    16 tservers: 15 min 23 sec
> We want to understand what are the factors which do not allow a
> smaller # tservers to give a similar performance as the larger #
> tservers. Is there anything we miss in configuration ?
> thanks,
> dmitry

View raw message