From user-return-1295-archive-asf-public=cust-asf.ponee.io@kudu.apache.org  Tue Mar 13 16:43:07 2018
Return-Path: <user-return-1295-archive-asf-public=cust-asf.ponee.io@kudu.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 2963318064F
	for <archive-asf-public@cust-asf.ponee.io>; Tue, 13 Mar 2018 16:43:06 +0100 (CET)
Received: (qmail 33173 invoked by uid 500); 13 Mar 2018 15:43:05 -0000
Mailing-List: contact user-help@kudu.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@kudu.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@kudu.apache.org>
List-Post: <mailto:user@kudu.apache.org>
List-Id: <user.kudu.apache.org>
Reply-To: user@kudu.apache.org
Delivered-To: mailing list user@kudu.apache.org
Received: (qmail 33142 invoked by uid 99); 13 Mar 2018 15:43:05 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Mar 2018 15:43:05 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id B7FC3180DE1
	for <user@kudu.apache.org>; Tue, 13 Mar 2018 15:43:04 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 1.879
X-Spam-Level: *
X-Spam-Status: No, score=1.879 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01,
	RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled
Authentication-Results: spamd3-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=cloudera.com
Received: from mx1-lw-us.apache.org ([10.40.0.8])
	by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024)
	with ESMTP id VV0V4Vj0dzYV for <user@kudu.apache.org>;
	Tue, 13 Mar 2018 15:43:03 +0000 (UTC)
Received: from mail-ua0-f169.google.com (mail-ua0-f169.google.com [209.85.217.169])
	by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id BB7865F5DC
	for <user@kudu.apache.org>; Tue, 13 Mar 2018 15:43:02 +0000 (UTC)
Received: by mail-ua0-f169.google.com with SMTP id n24so12799ual.12
        for <user@kudu.apache.org>; Tue, 13 Mar 2018 08:43:02 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=cloudera.com; s=google;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to;
        bh=iKO8UC3BDfmSgkNJjlWpx3XOUdvfuskLu+GMqkfOuFs=;
        b=gblNOVIJ4YqSh89Fah7lyX/Rc7A2Xh4xRPGxcR0hsCNt/har51G8WH7RkfvVIwMOaP
         1LIyvwQ2uvU8JlfK4Ch56WdQHnqw9wqRaOCZgHtVIC7GlscCsOdtIOTeZJRr1a++Wwjf
         arb3TlhU+ZBuDZY7/g9oog3rIiPBOMlHIhB+79bxAXebqueYEezQrK17Ir3oeOC8e5PW
         taFfgL0dfcLbuRAN/FmCqpzt7DMAGsVsUvtQ/M/JjDLVkkJ7zFnuw3IhLCP7X6WcNxyy
         0WIhaV49GwdcRNXwrDDKa9DgK1l8FIq20P9vf/Qf5XL23GDnz3huVuRAwXRd1ggfUgip
         PqMg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:in-reply-to:references:from:date
         :message-id:subject:to;
        bh=iKO8UC3BDfmSgkNJjlWpx3XOUdvfuskLu+GMqkfOuFs=;
        b=rNH1+v7MLmf9NCA902hatFTkq1ICqb0UMJBVHtv9jmIgg17JX5C/bVrlRfGRLs9UJW
         R6XCPezYlit0+toAg/PVppdJKsZLnzGuQQP/BsDA8UpypGFAWzmqvhmiqB+eLH/3PZvr
         oEvkil6w3JFUkOXtnyzrwGSojHsw0IfxxfyTEYpAJHveap8lZKcUeOYuBMWG/e6O4Eoj
         6yjXSZ6+he7NxxfE6FrwPhy4dW1z+WElg2h9/tl8nK1jgs7Z9KfhjckeHF5TsrytIxHm
         tLmt5doU0hYjnLdZMxMoDtXKNckvgl/wicN/aQtLPn82OJ3cXb/+y/B4ONPfWsIGmJFW
         fHMg==
X-Gm-Message-State: AElRT7EwY0/g2GnoixnVXMuXiY5ikAGco+9ddpsn6QBydR+8InqvfjAT
	CtjD1Z08oWFZTl3hhr3u9ZZ8mw4HogdQ+P4aSZyEMBqm
X-Google-Smtp-Source: AG47ELt2sN2Wya7HUx+B/ayh5ALtwvV4ypoKlqsXOTKpxvGRnQNhFnidH+zkk8UjB0KAm/yLpUNwmNXyJvMHp4x6rKk=
X-Received: by 10.176.48.140 with SMTP id h12mr862295ual.203.1520955780843;
 Tue, 13 Mar 2018 08:43:00 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.159.55.163 with HTTP; Tue, 13 Mar 2018 08:42:40 -0700 (PDT)
In-Reply-To: <EACC98B878B05641BA9AB277F4F3981B3A1D5F06@mbx04.360buyAD.local>
References: <EACC98B878B05641BA9AB277F4F3981B3A1D5F06@mbx04.360buyAD.local>
From: Todd Lipcon <todd@cloudera.com>
Date: Tue, 13 Mar 2018 08:42:40 -0700
Message-ID: <CADY20s7ULUygzLF60T6Q6ME=j6gF20c7bM29GP7zr+FGCjppGQ@mail.gmail.com>
Subject: Re: Follow-up for "Kudu cluster performance cannot grow up with
 machines added"
To: user@kudu.apache.org
Content-Type: multipart/alternative; boundary="089e08329ec80979e405674d1d51"

--089e08329ec80979e405674d1d51
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Mon, Mar 12, 2018 at 7:08 PM, =E5=BC=A0=E6=99=93=E5=AE=81 <zhangxiaoning=
@jd.com> wrote:

> To your and Brock=E2=80=99s questions, my answers are as below.
>
> What client are you using to benchmark? You might also be bound by the cl=
ient performance.
>
> My Answer: we are using many force testing machines to test the highest
> TPS on kudu. Our testing client should have enough ability.
>

But, specifically, what client? Is it something you build directly using
the Java client? The C++ client? How many threads are you using? Which
flush mode are you using to write? What buffer sizes are you using?


> I'd verify that the new nodes are assigned tablets? Along with considerin=
g
> an increase the number of partitions on the table being tested.
>
> My Answer: Yes, with machines added each time, I created a new table for
> testing so that tablets can be assigned to new machines. For the partitio=
n
> strategy, I am using 2-level partitions: the first level is a range
> partition by date(I use 3 partitions here, meaning 3-days data), and the
> second level is a hash partition(I use 3, 6, and 9 respectively for the
> clusters with 3, 6, and 9 tservers).
>
>
Did you delete the original table and wait some time before creating the
new table? Otherwise, you will see a skewed distribution where the new
table will have most of its replicas placed on the new empty machines. For
example:

1) with 6 servers, create table with 18 partitions
-- it will evenly spread replicas on those 6 nodes (probably 9 each)
2) add 3 empty servers, create a new table with 27 partitions
-- the new table will probably have about 18 partitions on the new nodes
and 3 on the existing nodes (6:1 skew)
3) same again
-- the new table will likely have most of its partitions on those 3 empty
nodes again

Of course with skew like that, you'll probably see that those new tables do
not perform well since most of the work would be on a smaller subset of
nodes.

If you delete the tables in between the steps you should see a more even
distribution.


Another possibility that you may be hitting is that our buffering in the
clients is currently cluster-wide. In other words, each time you apply an
operation, it checks if the total buffer limit has been reached, and if it
has, it flushes the pending writes to all tablets. Only once all of those
writes are complete is the batch considered "completed", freeing up space
for the next batch of writes to be buffered. This means that, as the number
of tablets and tablet servers grow, the completion time for the batch is
increasingly dominated by the high-percentile latencies of the writes
rather than the average, causing per-client throughput to drop.

This is tracked by KUDU-1693. I believe there was another JIRA somewhere
related as well, but can't seem to find it. Unfortunately fixing it is not
straightforward, though would have good impact for these cases where a
single writer is fanning out to tens or hundreds of tablets.

-Todd


--=20
Todd Lipcon
Software Engineer, Cloudera

--089e08329ec80979e405674d1d51
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_extra"><div class=3D"gmail_quote">On M=
on, Mar 12, 2018 at 7:08 PM, =E5=BC=A0=E6=99=93=E5=AE=81 <span dir=3D"ltr">=
&lt;<a href=3D"mailto:zhangxiaoning@jd.com" target=3D"_blank">zhangxiaoning=
@jd.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D=
"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div lang=3D"ZH-CN" link=3D"blue" vlink=3D"purple">
<div class=3D"m_-821602141659384086m_824420216538786965WordSection1">
<p class=3D"MsoNormal"><span lang=3D"EN-US" style=3D"font-size:10.5pt;font-=
family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">To your an=
d Brock=E2=80=99s questions, my answers are as below.</span><span lang=3D"E=
N-US"><u></u><u></u></span></p>
<pre style=3D"background:white"><span lang=3D"EN-US" style=3D"font-size:9.0=
pt;color:black">What client are you using to benchmark? You might also be b=
ound by the client performance.</span><span lang=3D"EN-US"><u></u><u></u></=
span></pre>
<p class=3D"MsoNormal" style=3D"background:white">
<span lang=3D"EN-US" style=3D"font-size:9.0pt;color:#376092">My Answer: we =
are using many force testing machines to test the highest TPS on kudu. Our =
testing client should have enough ability.</span></p></div></div></blockquo=
te><div><br></div><div>But, specifically, what client? Is it something you =
build directly using the Java client? The C++ client? How many threads are =
you using? Which flush mode are you using to write? What buffer sizes are y=
ou using?</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"=
margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div lang=3D=
"ZH-CN" link=3D"blue" vlink=3D"purple"><div class=3D"m_-821602141659384086m=
_824420216538786965WordSection1"><p class=3D"MsoNormal" style=3D"background=
:white"><span lang=3D"EN-US"><u></u><u></u></span></p>
<p class=3D"MsoNormal" style=3D"background:white">
<span lang=3D"EN-US" style=3D"font-size:9.0pt;color:black">I&#39;d verify t=
hat the new nodes are assigned tablets? Along with considering an increase =
the number of partitions on the table being tested.</span><span lang=3D"EN-=
US"><u></u><u></u></span></p>
<p class=3D"MsoNormal"><span lang=3D"EN-US" style=3D"font-size:9.0pt;color:=
#376092">My Answer: Yes, with machines added each time, I created a new tab=
le for testing so that tablets can be assigned to new machines. For the par=
tition strategy, I am using 2-level partitions:
 the first level is a range partition by date(I use 3 partitions here, mean=
ing 3-days data), and the second level is a hash partition(I use 3, 6, and =
9 respectively for the clusters with 3, 6, and 9 tservers).</span><span lan=
g=3D"EN-US" style=3D"font-size:10.5pt;font-family:&quot;Calibri&quot;,&quot=
;sans-serif&quot;"><u></u><u></u></span></p>
<p class=3D"MsoNormal"><span lang=3D"EN-US"><u></u></span></p></div></div><=
/blockquote><div><br></div><div>Did you delete the original table and wait =
some time before creating the new table? Otherwise, you will see a skewed d=
istribution where the new table will have most of its replicas placed on th=
e new empty machines. For example:</div><div><br></div><div>1) with 6 serve=
rs, create table with 18 partitions</div><div>-- it will evenly spread repl=
icas on those 6 nodes (probably 9 each)</div><div>2) add 3 empty servers, c=
reate a new table with 27 partitions</div><div>-- the new table will probab=
ly have about 18 partitions on the new nodes and 3 on the existing nodes (6=
:1 skew)</div><div>3) same again</div><div>-- the new table will likely hav=
e most of its partitions on those 3 empty nodes again</div><div><br></div><=
div>Of course with skew like that, you&#39;ll probably see that those new t=
ables do not perform well since most of the work would be on a smaller subs=
et of nodes.</div><div><br></div><div>If you delete the tables in between t=
he steps you should see a more even distribution.</div></div><div><br></div=
><div><br></div><div>Another possibility that you may be hitting is that ou=
r buffering in the clients is currently cluster-wide. In other words, each =
time you apply an operation, it checks if the total buffer limit has been r=
eached, and if it has, it flushes the pending writes to all tablets. Only o=
nce all of those writes are complete is the batch considered &quot;complete=
d&quot;, freeing up space for the next batch of writes to be buffered. This=
 means that, as the number of tablets and tablet servers grow, the completi=
on time for the batch is increasingly dominated by the high-percentile late=
ncies of the writes rather than the average, causing per-client throughput =
to drop.</div><div><br></div><div>This is tracked by KUDU-1693. I believe t=
here was another JIRA somewhere related as well, but can&#39;t seem to find=
 it. Unfortunately fixing it is not straightforward, though would have good=
 impact for these cases where a single writer is fanning out to tens or hun=
dreds of tablets.</div><div><br></div><div>-Todd</div><div><br></div><div><=
br></div><div><br></div><div><br></div>-- <br><div class=3D"m_-821602141659=
384086gmail_signature" data-smartmail=3D"gmail_signature">Todd Lipcon<br>So=
ftware Engineer, Cloudera</div>
</div></div>

--089e08329ec80979e405674d1d51--