kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manuel Sopena Ballesteros <manuel...@garvan.org.au>
Subject RE: Underutilization of hardware resources with smaller number of Tservers
Date Mon, 15 Jul 2019 07:16:33 GMT
VMs are not overcommitted --> multiples vcpus do not share same physical cpu.
I put multiple folders to simulate more disks in kudu, I thought this was a good choice since
we have all-flash configuration.

Manuel

-----Original Message-----
From: Dmitry Degrave [mailto:dmeetry@gmail.com]
Sent: Saturday, July 13, 2019 2:37 PM
To: user@kudu.apache.org
Subject: Re: Underutilization of hardware resources with smaller number of Tservers

Thanks for looking at this Adar.

I put more details about our ETL process, but this might not be
important after all. Specifics of our ETL is not significant here. The
important point is that we are able to reproduce the same
effect/difference with Kudu's own synthetic tests. I hope all the
details below do not deviate us from this point.

Let me refocus the question, do you guys see the same performance
difference in synthetic tests between configs with different #
tservers on your clusters ?

> Thanks for the detailed summary and analysis. I want to make sure I
> understand the overall cluster topology. You have two physical hosts,
> each running one VM, with each VM running either 4 or 8 tservers. Is
> that correct?

yes, that's right. In the first test we ran both nodes with 4
tservers, in the second test we ran both nodes with 8 tservers. We
initially started playing with 3 VMs/phys nodes in cluster with
similar configs and had the same results in performance differences.
We don't have all the logs and graphs from time with 3 VMs and later
temporarily had to give up one of the nodes, hence all performance
charts are from cluster with 2 nodes, but the point stays the same.
Difference stays the same.

> Here are some other questions:
> - Have you verified that your _client_ machines (i.e. the ETL job
> machine or the machine running loadgen) aren't fully saturated? If it
> is, that would explain why the server machines aren't fully utilized.
> You mention adding more and more tservers until the hardware is
> saturated so I imagine the client machines aren't the bottleneck, but
> it's good to check.

Yes and No.

For our ETL data, Spark node was fully utilized from time to time in
terms of CPUs, but it wasn't overstressed (load avg <50 with 56
cores). The way how Spark processes the data is causing the load to
Kudu coming in waves. Hence, we might not reach the maximum possible
performance on Kudu side, but this does not explain perf differences
between configs with different # tservers as they experience the same
pressure from Spark in both tests.

For synthetic tests, we didn't reach saturation on load-generated node
- links below for CPU utilization (in this runs kudu-1 was
load-generator and kudu-2/3 were kudu nodes):

load generator:
https://user-images.githubusercontent.com/20377386/61166839-07fa1800-a578-11e9-806f-c56a8e57e506.png

one of kudu nodes:
https://user-images.githubusercontent.com/20377386/61166946-f0bc2a00-a579-11e9-8f87-ce7719503a89.png

As a side note, we are going to repeat the tests with Spark running on
cluster next week. Obvious move wrt overall performance, but it is
unlikely to change the results with regard to the topic in question.

> - How is the memory utilization? Are the VMs overcommitted? Is there
> any swapping? I noticed you haven't configured a memory limit for the
> tservers, which means each will try to utilize 80% of the available
> RAM on the machine. That's significant overcommitment when there are
> multiple tservers in each VM.

In ETL, the real used pages were below 60GB total on node - the rest
were used by kernel for caching (up to 500GB).

In synthetic tests memory is not a factor at all - it uses a tiny
fraction of what's available (500GB).

> - The tserver configuration shows three data directories per NVMe. Why
> not just one data directory per NVMe?

Manuel played with this parameter - Manuel, any comments ?

WAL are distributed accros all NVMe drives evenly.

> - The schema is helpful, but it's not clear how many hash buckets are
> in use. I imagine you vary that based on the number of tservers, but
> it'd be good to know the details here.

we kept it constant across the tests. Starting from the global max =
180 and the goal RF3 for prod config, we get 60 as the max. So it was
60 for GT table (the largest), 30 for variant table, and 4 for sample
table. It's worth to mention that in tests we used RF1.

> On an unrelated note, you mentioned using one master but the
> configurations show a cluster with two. That's probably not related to
> the performance issue, but you should know that a 2-master
> configuration can't tolerate any faults; you should use 3 masters if
> your goal is tolerate the loss of a master.

for tests we started with 3 VMs / 3 phys hosts, with 1 master per VM.
Later we ran the tests with just 2 VMs, still 1 master per VM (one
phys node was out of the game due to h/w problems). For production we
are planning to use 3-6 phys nodes, 1 VM per node.

thanks,
dmitry

>
>
> On Fri, Jul 12, 2019 at 9:19 AM Dmitry Degrave <dmeetry@gmail.com> wrote:
> >
> > Hi All,
> >
> > We have a small 1.9.0 Kudu cluster for pre-production testing with
> > rather powerful nodes. Each node has:
> >
> > 28 cores with HT (total 56 logical cores)
> > Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
> >
> > RAM: 503GiB
> > 10 NVME drives
> >
> > We run OpenStack VMs on top of them, 1 VM per physical host. Each VM
> > has 50 cores, 500GiB RAM and pci passthrough access to 4 NVMe drives.
> >
> > It is expected to see the same performance with a single tserver per
> > VM comparing with multiple tservers per VM (giving the total number of
> > MM threads per VM is equal) - based on discussion we had on
> > getkudu/kudu-general channel.
> >
> > This is not what we actually have seen during the last months. Based
> > on our experience, more tservers utilize hardware resources better and
> > give better overall ETL performance than less tservers.
> >
> > We are trying to understand the reasons.
> >
> > To demonstrate the difference, below are results and configuration for
> > running cluster on 2 VMs, each on separate physical host, with 4 and 8
> > tservers per VM and 17 maintenance manager threads per VM total. We
> > run a single master per VM in both cases with identical config.
> >
> > We see the same pattern with more VMs in cluster and with different
> > number of tservers per VM (keeping total number of maintenance manager
> > threads per VM equal) - the more tservers per VM, the better
> > performance (till certain limit when we reach h/w capacity).
> >
> > Starting 4 tservers per VM:
> > https://gist.github.com/dnafault/96bcac24974ea4e50384ecd22161510d
> >
> > Starting 8 tservers per VM:
> > https://gist.github.com/dnafault/a16ae893b881b44d95378b5e4a056862
> >
> > Starting 1 master per VM:
> > https://gist.github.com/dnafault/87ec3565e36b0a88568f8b86f37f9521
> >
> > In both cases we start ETL from Spark node on separate physical host
> > with identical hardware resources as above. ETL is finished in 910 min
> > with 4 tservers and in 830 min with 8 tservers.
> >
> > Looking at how hardware is utilized, we can see that all resources are
> > utilized less with 4 tservers:
> >
> > - CPUs (50 cores on VM):
> >     4 tservers load avg:
> > https://user-images.githubusercontent.com/20377386/60939960-21ebee80-a31d-11e9-911a-1e5cc2abcb26.png
> >     8 tservers load avg:
> > https://user-images.githubusercontent.com/20377386/60940404-b3a82b80-a31e-11e9-8ce7-2f843259bf91.png
> >
> > - Network:
> >     4 tservers, kb:
> > https://user-images.githubusercontent.com/20377386/60940071-96bf2880-a31d-11e9-8aa6-6decaa3f651a.png
> >     8 tservers, kb:
> > https://user-images.githubusercontent.com/20377386/60940503-0bdf2d80-a31f-11e9-85de-f5e4b0959c1b.png
> >
> >     4 tservers, pkts:
> > https://user-images.githubusercontent.com/20377386/60940063-92930b00-a31d-11e9-81cf-b7e9f05e53ec.png
> >     8 tservers, pkts:
> > https://user-images.githubusercontent.com/20377386/60940441-d0dcfa00-a31e-11e9-8c03-330024339a34.png
> >
> > - NVMe drives:
> >     4 tservers:
> > https://user-images.githubusercontent.com/20377386/60940249-2238b980-a31e-11e9-8fff-2233ce18f74d.png
> >     8 tservers:
> > https://user-images.githubusercontent.com/20377386/60940534-2e714680-a31f-11e9-956e-76c175de6bfa.png
> >
> > Schema in our tests:
> > https://gist.github.com/dnafault/e55ea987c55d2960c738d94e4811d043
> >
> > Variant table has 9*10^7 rows, GT table has 1.4*10^10 rows.
> >
> > Manuel reproduced the same pattern with synthetic tests on the same
> > cluster (all configs use 17MM threads total per VM). Test load is
> > generated from external VM (no kudu node on it) with:
> >
> > seq 20 | xargs -L1 -I{} -P4 time ./kudu perf loadgen kudu-1,kudu-3
> > --num_rows_per_thread=50000000 --num_threads=6
> >
> > Time to finish tests:
> >
> >     4 tservers: 18 min 11 sec
> >     8 tservers: 16 min  4 sec
> >    16 tservers: 15 min 23 sec
> >
> > We want to understand what are the factors which do not allow a
> > smaller # tservers to give a similar performance as the larger #
> > tservers. Is there anything we miss in configuration ?
> >
> > thanks,
> > dmitry
NOTICE
Please consider the environment before printing this email. This message and any attachments
are intended for the addressee named and may contain legally privileged/confidential/copyright
information. If you are not the intended recipient, you should not read, use, disclose, copy
or distribute this communication. If you have received this message in error please notify
us at once by return email and then delete both messages. We accept no liability for the distribution
of viruses or similar in electronic communications. This notice should not be removed.
Mime
View raw message