From user-return-1366-archive-asf-public=cust-asf.ponee.io@kudu.apache.org  Fri May 11 21:48:27 2018
Return-Path: <user-return-1366-archive-asf-public=cust-asf.ponee.io@kudu.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 35AD7180647
	for <archive-asf-public@cust-asf.ponee.io>; Fri, 11 May 2018 21:48:27 +0200 (CEST)
Received: (qmail 60891 invoked by uid 500); 11 May 2018 19:48:26 -0000
Mailing-List: contact user-help@kudu.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@kudu.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@kudu.apache.org>
List-Post: <mailto:user@kudu.apache.org>
List-Id: <user.kudu.apache.org>
Reply-To: user@kudu.apache.org
Delivered-To: mailing list user@kudu.apache.org
Received: (qmail 60874 invoked by uid 99); 11 May 2018 19:48:25 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 May 2018 19:48:25 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 2D48BC8746
	for <user@kudu.apache.org>; Fri, 11 May 2018 19:48:25 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 1.88
X-Spam-Level: *
X-Spam-Status: No, score=1.88 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01,
	RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001]
	autolearn=disabled
Authentication-Results: spamd1-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=cloudera.com
Received: from mx1-lw-us.apache.org ([10.40.0.8])
	by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024)
	with ESMTP id 1DPteRXBf9Id for <user@kudu.apache.org>;
	Fri, 11 May 2018 19:48:24 +0000 (UTC)
Received: from mail-lf0-f48.google.com (mail-lf0-f48.google.com [209.85.215.48])
	by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 68D6F5F1BE
	for <user@kudu.apache.org>; Fri, 11 May 2018 19:48:23 +0000 (UTC)
Received: by mail-lf0-f48.google.com with SMTP id m17-v6so4375481lfj.8
        for <user@kudu.apache.org>; Fri, 11 May 2018 12:48:23 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=cloudera.com; s=google;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to;
        bh=OHxclCYLg49mhZbUGpo0to7hD2RviiLFlYRvOTiI/Fg=;
        b=SUQany1WAlMBRRxVv9RFgP4LWATb6p8wMEy1hDmRoSr6n1C95seJQaGQ/6G43gpUfY
         r0ALTTU3YxJZkod83wLvlbLQv8Nl2J09lETkzx3xWNrrlJQT0JMaHUFxAw88vzgfN+GO
         ehVvpvz4c6KrA0rUDoC8r8RJC6IdX6ERdcKLBOlQJvOnwkwm2eQjrQSow+yibGSSCKJ0
         fyZtuibGlyO0z83uKAJAk5yTNrPvxo1YkZ+0WeJ8tnnCUF2p9Fg1kjbH2yI1qw5KiqI5
         7gyHIrsFT3Qpl76FrgGKY/vqktEbkkdO5FXZil8RQzDq9SCZNrelC7tFzTH6fFqmMV3G
         Lqlg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:in-reply-to:references:from:date
         :message-id:subject:to;
        bh=OHxclCYLg49mhZbUGpo0to7hD2RviiLFlYRvOTiI/Fg=;
        b=i0rIy2T7mtderhLKeKndEhEO5OzO0jYKYX44ZnIkVnw38V/hvLzSFhFhsFmyqDDkaQ
         9fohDcLoW145LhnejvXiFI9+DfBLelIe5+bTQvfpMdg9nU246rxiACPRB6nMzhzi5G5M
         mHvxjvt/dOaG0NtUhNWFEBnOrbP9zMJ1HGSpR0MCCaMmFiLxeqkJqle4XmcItQuFSD1O
         O6OlDM34FcYUYL/V91UsWS1ukHbquhbbuH8M6hkgZYbJuEDUNv6iQAlZ8bFRYD/9RL+J
         zPa1pnByH/xF5krrsQlByb4p+V7gsGzaQmidg1OOeBLRdR84rRyTfya3cdhQnL13f8Fl
         wQpA==
X-Gm-Message-State: ALKqPwfxiatesjo8Dg6Huxjw/ML+3Hty0spuXu9KoyzZQ4JQkAsbyZwU
	W/UFkjEEw2JCslXZGlRevKjQ9o7wK538FlJdSEMuCKFt
X-Google-Smtp-Source: AB8JxZrwasoKCZD6kU8ASen/SpITT76QJj4q2D7iJyYgINQU3NbYKAY35zmzAyOfCsJ+fRPpWbx/dkuWqTcPnB9L2qw=
X-Received: by 2002:a2e:9d41:: with SMTP id y1-v6mr4813533ljj.112.1526068101274;
 Fri, 11 May 2018 12:48:21 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.46.82.138 with HTTP; Fri, 11 May 2018 12:48:00 -0700 (PDT)
In-Reply-To: <BN3PR0601MB138405C933A2F66BA2DDEE2E9F9F0@BN3PR0601MB1384.namprd06.prod.outlook.com>
References: <BN3PR0601MB138405C933A2F66BA2DDEE2E9F9F0@BN3PR0601MB1384.namprd06.prod.outlook.com>
From: Todd Lipcon <todd@cloudera.com>
Date: Fri, 11 May 2018 12:48:00 -0700
Message-ID: <CADY20s5C17N83Obi8rMA1e3L7K7KmeXih0NcMwq+SibMmz9uBw@mail.gmail.com>
Subject: Re: Kudu read - performance issue
To: user@kudu.apache.org
Content-Type: multipart/alternative; boundary="00000000000014906b056bf36bae"

--00000000000014906b056bf36bae
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Fri, May 11, 2018 at 12:05 PM, Todor Petrov <Todor.Petrov@rms.com> wrote=
:

> Hi there,
>
>
>
> I have an interesting performance issue reading from Kudu. Hopefully ther=
e
> is a good explanation for it because the difference in the performance is
> quite significant and it puzzles me a lot.
>
>
>
> Basically we have a table with the following schema:
>
>
>
> *Column1, int32 NOT NULL, BIT_SHUFFLE, NO_COMPRESSION*
>
> *Column2, int32 NOT NULL, BIT_SHUFFLE, NO_COMPRESSION*
>
> *=E2=80=A6. (a bunch of int32 and int16 columns)*
>
>
>
> *PK is (Column1, Column2)*
>
> *HASH(Column1) PARTITIONS 4*
>
>
>
> The number of records is *~60M*. *~5K* distinct Column1 values. *~1.4M*
> distinct values for Column2.
>
>
>
> All tests are made on one core. I think the hardware specs are not
> important.
>
>
>
> 1)      If we query all data using
>
>
>
> *      val scanner =3D *
>
> *        kuduClient.getAsyncScannerBuilder(table)*
>
> *
> .addPredicate(KuduPredicate.newComparisonPredicate(Column1Schema,
> ComparisonOp.EQUAL, column1Value)).build()*
>
>
>
> We use 3 scanners in parallel (one query for each unique value of column1=
).
>
>
>
> All fields from the returned rows are read and some internal structures
> are built.
>
>
>
> In this case, it takes *~40 sec* to load all the data.
>
>
>
> 2)      If we query using =E2=80=9CInListPredicate=E2=80=9D, then the per=
formance is
> super slow.
>
>
>
> *      val scanner =3D *
>
> *        kuduClient.getAsyncScannerBuilder(table)*
>
> *
> .addPredicate(KuduPredicate.newComparisonPredicate(Column1Schema,
> ComparisonOp.EQUAL, column1Value))*
>
> *          .addPredicate(KuduPredicate.newInListPredicate(Column2Schema,
> column2Values.asJava)).build()*
>
>
>
> Same as in 1), 3 scanners in parallel, all records are read and some
> in-memory structures are built. This time column2 values are split into a
> bunch of chunks and we send a request for each unique value of column1 an=
d
> each chunk of column2 values.
>

Are you sorting the values of 'column2' before doing the chunking? Kudu
doesn't use indexes for evaluating IN-list predicates except for using the
min(in-list-values) and max(in-list-values). So, if you had for example:

pre-chunk in-list: 1,2,3,4,5,6
chunk 1: col2 IN (1,6)
chunk 2: col2 IN (2,5)
chunk 3: col2 IN (3,4)

then you will actually scan over the middle portion of that table 3 times.

If you sort the in-list before chunking you'll avoid the multiple-scan
effect here.

-Todd
--=20
Todd Lipcon
Software Engineer, Cloudera

--00000000000014906b056bf36bae
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_extra"><div class=3D"gmail_quote">On F=
ri, May 11, 2018 at 12:05 PM, Todor Petrov <span dir=3D"ltr">&lt;<a href=3D=
"mailto:Todor.Petrov@rms.com" target=3D"_blank">Todor.Petrov@rms.com</a>&gt=
;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex">


<div lang=3D"EN-US" link=3D"#0563C1" vlink=3D"#954F72">
<div class=3D"m_7064306137698606263WordSection1">
<p class=3D"MsoNormal">Hi there,<u></u><u></u></p>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<p class=3D"MsoNormal">I have an interesting performance issue reading from=
 Kudu. Hopefully there is a good explanation for it because the difference =
in the performance is quite significant and it puzzles me a lot.<u></u><u><=
/u></p>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<p class=3D"MsoNormal">Basically we have a table with the following schema:=
<u></u><u></u></p>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<p class=3D"MsoNormal"><i>Column1, int32 NOT NULL, BIT_SHUFFLE, NO_COMPRESS=
ION<u></u><u></u></i></p>
<p class=3D"MsoNormal"><i>Column2, int32 NOT NULL, BIT_SHUFFLE, NO_COMPRESS=
ION<u></u><u></u></i></p>
<p class=3D"MsoNormal"><i>=E2=80=A6. (a bunch of int32 and int16 columns)<u=
></u><u></u></i></p>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<p class=3D"MsoNormal"><i>PK is (Column1, Column2)<u></u><u></u></i></p>
<p class=3D"MsoNormal"><i>HASH(Column1) PARTITIONS 4<u></u><u></u></i></p>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<p class=3D"MsoNormal">The number of records is <b>~60M</b>. <b>~5K</b> dis=
tinct Column1 values.
<b>~1.4M</b> distinct values for Column2.<u></u><u></u></p>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<p class=3D"MsoNormal">All tests are made on one core. I think the hardware=
 specs are not important.<u></u><u></u></p>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<p class=3D"m_7064306137698606263MsoListParagraph"><u></u><span>1)<span sty=
le=3D"font:7.0pt &quot;Times New Roman&quot;">=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0
</span></span><u></u>If we query all data using <u></u><u></u></p>
<p class=3D"m_7064306137698606263MsoListParagraph"><u></u>=C2=A0<u></u></p>
<p class=3D"m_7064306137698606263MsoListParagraph"><i>=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0 val scanner =3D <u></u><u></u></i></p>
<p class=3D"m_7064306137698606263MsoListParagraph"><i>=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0kuduClient.<wbr>getAsyncScannerBuilder(table)<u>=
</u><u></u></i></p>
<p class=3D"m_7064306137698606263MsoListParagraph"><i>=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 .addPredicate(KuduPredicate.<wbr>newCompa=
risonPredicate(<wbr>Column1Schema, ComparisonOp.EQUAL, column1Value)).build=
()<u></u><u></u></i></p>
<p class=3D"m_7064306137698606263MsoListParagraph"><u></u>=C2=A0<u></u></p>
<p class=3D"m_7064306137698606263MsoListParagraph">We use 3 scanners in par=
allel (one query for each unique value of column1).<u></u><u></u></p>
<p class=3D"m_7064306137698606263MsoListParagraph"><u></u>=C2=A0<u></u></p>
<p class=3D"m_7064306137698606263MsoListParagraph">All fields from the retu=
rned rows are read and some internal structures are built.<u></u><u></u></p=
>
<p class=3D"m_7064306137698606263MsoListParagraph"><u></u>=C2=A0<u></u></p>
<p class=3D"m_7064306137698606263MsoListParagraph">In this case, it takes <=
b>~40 sec</b> to load all the data.<u></u><u></u></p>
<p class=3D"m_7064306137698606263MsoListParagraph"><u></u>=C2=A0<u></u></p>
<p class=3D"m_7064306137698606263MsoListParagraph"><u></u><span>2)<span sty=
le=3D"font:7.0pt &quot;Times New Roman&quot;">=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0
</span></span><u></u>If we query using =E2=80=9CInListPredicate=E2=80=9D, t=
hen the performance is super slow.<u></u><u></u></p>
<p class=3D"m_7064306137698606263MsoListParagraph"><u></u>=C2=A0<u></u></p>
<p class=3D"m_7064306137698606263MsoListParagraph"><i>=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0 val scanner =3D <u></u><u></u></i></p>
<p class=3D"m_7064306137698606263MsoListParagraph"><i>=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0kuduClient.<wbr>getAsyncScannerBuilder(table)<u>=
</u><u></u></i></p>
<p class=3D"m_7064306137698606263MsoListParagraph"><i>=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 .addPredicate(KuduPredicate.<wbr>newCompa=
risonPredicate(<wbr>Column1Schema, ComparisonOp.EQUAL, column1Value))<u></u=
><u></u></i></p>
<p class=3D"m_7064306137698606263MsoListParagraph"><i>=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 .addPredicate(KuduPredicate.<wbr>newInLis=
tPredicate(<wbr>Column2Schema, column2Values.asJava)).build()<u></u><u></u>=
</i></p>
<p class=3D"m_7064306137698606263MsoListParagraph"><u></u>=C2=A0<u></u></p>
<p class=3D"m_7064306137698606263MsoListParagraph">Same as in 1), 3 scanner=
s in parallel, all records are read and some in-memory structures are built=
. This time column2 values are split into a bunch of chunks and we send a r=
equest for each unique value of column1 and each chunk
 of column2 values.</p></div></div></blockquote><div><br></div><div>Are you=
 sorting the values of &#39;column2&#39; before doing the chunking? Kudu do=
esn&#39;t use indexes for evaluating IN-list predicates except for using th=
e min(in-list-values) and max(in-list-values). So, if you had for example:<=
/div><div><br></div><div>pre-chunk in-list: 1,2,3,4,5,6</div><div>chunk 1: =
col2 IN (1,6)</div><div>chunk 2: col2 IN (2,5)</div><div>chunk 3: col2 IN (=
3,4)</div><div><br></div><div>then you will actually scan over the middle p=
ortion of that table 3 times.</div><div><br></div><div>If you sort the in-l=
ist before chunking you&#39;ll avoid the multiple-scan effect here.</div><d=
iv><br></div><div>-Todd</div></div>-- <br><div class=3D"gmail_signature" da=
ta-smartmail=3D"gmail_signature">Todd Lipcon<br>Software Engineer, Cloudera=
</div>
</div></div>

--00000000000014906b056bf36bae--