Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
MIME-Version: 1.0
In-Reply-To: <CAKZg861j+SFw-Hhi0OGYVY_njauOpYths5yzvdhpihzMDcZ3Yg@mail.gmail.com>
References: <CAKZg861j+SFw-Hhi0OGYVY_njauOpYths5yzvdhpihzMDcZ3Yg@mail.gmail.com>
Date: Sat, 14 May 2016 12:48:25 +0100
Message-ID: <CAJ3fcbA-EmWMwWV58XT_HdZEDgojfp5BRQTcWBGcXKhdEDj5rg@mail.gmail.com>
Subject: Re: clustered bucket and tablesample
From: Mich Talebzadeh <mich.talebzadeh@gmail.com>
To: user <user@hive.apache.org>
Content-Type: multipart/alternative; boundary=94eb2c055c36104e560532cbf81b
archived-at: Sat, 14 May 2016 11:48:38 -0000

--94eb2c055c36104e560532cbf81b
Content-Type: text/plain; charset=UTF-8

Is action_id can be created as a numeric column:

CREATE TABLE X ( action_id bigint,  ..)

Bucketing or hash partitioning best works on numeric columns with high
cardinality (say a primary key).

From my old notes:

Bucketing in Hive refers to hash partitioning where a hashing function is
applied. Likewise an RDBMS like Oracle, Hive will apply a linear hashing
algorithm to prevent data from clustering within specific partitions.
Hashing is very effective if the column selected for bucketing has very
high selectivity like an ID column where selectivity (select
count(distinct(column))/count(column) ) = 1.  In this case, the created
partitions/ files will be as evenly sized as possible. In a nutshell
bucketing is a method to get data evenly distributed over many
partitions/files.  One should define the number of buckets by a power of
two -- 2^n,  like 2, 4, 8, 16 etc to achieve best results. Again bucketing
will help concurrency in Hive. It may even allow a partition wise join i.e.
a join between two tables that are bucketed on the same column with the
same number of buckets (anyone has tried this?)


One more things. When one defines the number of buckets at table creation
level in Hive, the number of partitions/files will be fixed. In contrast,
with partitioning you do not have this limitation.

can you do

show create table X

and send the output. please.


Thanks


Dr Mich Talebzadeh


LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*


http://talebzadehmich.wordpress.com


On 14 May 2016 at 12:23, no jihun <jeesim2@gmail.com> wrote:

> Hello.
>
> I want to ask the correct bucketing and tablesample way.
>
> There is a table X which I created by
>
> CREATE TABLE `X`(`action_id` string,`classifier` string)
> CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
> STORED AS ORC
>
> Then I inserted 500M of rows into X by
>
> set hive.enforce.bucketing=true;
> INSERT OVERWRITE INTO X SELECT * FROM X_RAW
>
> Then I want to count or search some rows with condition. roughly,
>
> SELECT COUNT(*) FROM X WHERE action_id='aaa' AND classifier='bbb'
>
> But I'd better to USE tablesample as I clustered X (action_id,
> classifier). So, the better query will be
>
> SELECT COUNT(*) FROM X
> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
> WHERE action_id='aaa' AND classifier='bbb'
>
> Is there any wrong above? But I can't not find any performance gain
> between these two query.
>
> query1 and RESULT( with no tablesample.)
>
> SELECT COUNT(*)) from X
> WHERE action_id='aaa' and classifier='bbb'
>
>
> --------------------------------------------------------------------------------
>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
> KILLED
>
> --------------------------------------------------------------------------------
> Map 1 ..........   SUCCEEDED    256        256        0        0
> 0       0
> Reducer 2 ......   SUCCEEDED      1          1        0        0
> 0       0
>
> --------------------------------------------------------------------------------
> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 15.35
> s
>
> --------------------------------------------------------------------------------
> It scans full data.
>
> query 2 and RESULT
>
> SELECT COUNT(*)) from X
> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
> WHERE action_id='aaa' and classifier='bbb'
>
>
> --------------------------------------------------------------------------------
>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
> KILLED
>
> --------------------------------------------------------------------------------
> Map 1 ..........   SUCCEEDED    256        256        0        0
> 0       0
> Reducer 2 ......   SUCCEEDED      1          1        0        0
> 0       0
>
> --------------------------------------------------------------------------------
> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
> 15.82     s
>
> --------------------------------------------------------------------------------
> It ALSO scans full data.
>
> query 2 RESULT WHAT I EXPECTED.
>
> Result what I expected is something like...
> (use 1 map and relatively faster than without tabmesample)
>
> --------------------------------------------------------------------------------
>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
> KILLED
>
> --------------------------------------------------------------------------------
> Map 1 ..........   SUCCEEDED      1          1        0        0
> 0       0
> Reducer 2 ......   SUCCEEDED      1          1        0        0
> 0       0
>
> --------------------------------------------------------------------------------
> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
> 3.xx     s
>
> --------------------------------------------------------------------------------
>
> Values of action_id and classifier are well distributed and there is no
> skewed data.
>
> So I want to ask you what will be a correct query that prune and target
> specific bucket by multiple column?
>

--94eb2c055c36104e560532cbf81b
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Is action_id can be created as a numeric column:</div=
><div><br></div><div>CREATE TABLE X (=C2=A0action_id bigint,=C2=A0 ..)</div=
><div><br></div><div>Bucketing or hash partitioning best works on numeric c=
olumns with high cardinality (say a primary key).</div><div><br></div><div>=
From my old notes:</div><div><br></div><div><font color=3D"#000000" face=3D=
"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><span lang=3D"EN-GB" style=3D"font-f=
amily:&quot;Arial&quot;,sans-serif;font-size:11pt"><font color=3D"#000000">=
Bucketing in Hive refers to hash partitioning where a
hashing function is applied. Likewise an RDBMS like Oracle, Hive will apply=
 a
linear hashing algorithm to prevent data from clustering within specific
partitions. Hashing is very effective if the column selected for bucketing =
has
very high selectivity like an ID column where selectivity (select
count(distinct(column))/count(column) ) =3D 1.=C2=A0 In this case, the crea=
ted
partitions/ files will be as evenly sized as possible. In a nutshell bucket=
ing
is a method to get data evenly distributed over many partitions/files.=C2=
=A0
One should define the number of buckets by a power of two -- 2^n,=C2=A0 lik=
e 2,
4, 8, 16 etc to achieve best results. Again bucketing will help concurrency=
 in
Hive. It may even allow a partition wise join i.e. a join between two table=
s
that are bucketed on the same column with the same number of buckets (anyon=
e
has tried this?)</font></span></p><font color=3D"#000000" face=3D"Times New=
 Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><span lang=3D"EN-GB" style=3D"font-f=
amily:&quot;Arial&quot;,sans-serif;font-size:11pt"><font color=3D"#000000">=
=C2=A0</font></span></p><font color=3D"#000000" face=3D"Times New Roman" si=
ze=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><span lang=3D"EN-GB" style=3D"font-f=
amily:&quot;Arial&quot;,sans-serif;font-size:11pt"><font color=3D"#000000">=
One more things. When one defines the number of
buckets at table creation level in Hive, the number of partitions/files wil=
l be
fixed. In contrast, with partitioning you do not have this limitation. </fo=
nt></span></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font></div><div><br></div><div>can you do </div><div><br></div><div>show =
create table X </div><div><br></div><div>and send the output. please.</div>=
<div><br></div><div><br></div><div><br></div><div>Thanks</div><div><br></di=
v></div><div class=3D"gmail_extra"><br clear=3D"all"><div><div class=3D"gma=
il_signature"><div dir=3D"ltr"><font color=3D"#000000" face=3D"Times New Ro=
man" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><font color=3D"#000000" face=3D"Cali=
bri" size=3D"3">Dr Mich Talebzadeh</font></p><font color=3D"#000000" face=
=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><font color=3D"#000000" face=3D"Cali=
bri" size=3D"3">=C2=A0</font></p><font color=3D"#000000" face=3D"Times New =
Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><span style=3D"font-family:&quot;Ari=
al&quot;,sans-serif"><font color=3D"#000000" size=3D"3">LinkedIn </font></s=
pan><i><span style=3D"font-family:&quot;Arial&quot;,sans-serif;font-size:10=
pt"><font color=3D"#000000">=C2=A0</font><a href=3D"https://www.linkedin.co=
m/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw" target=3D"_bla=
nk"><font color=3D"#0000ff">https://www.linkedin.com/profile/view?id=3DAAEA=
AAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw</font></a></span></i></p><font color=3D=
"#000000" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><font color=3D"#000000" face=3D"Cali=
bri" size=3D"3">=C2=A0</font></p><font color=3D"#000000" face=3D"Times New =
Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt;text-align:justify"><span style=3D"fo=
nt-family:&quot;Arial&quot;,sans-serif;font-size:10pt"><a href=3D"http://ta=
lebzadehmich.wordpress.com/" target=3D"_blank"><font color=3D"#0000ff">http=
://talebzadehmich.wordpress.com</font></a></span></p><font color=3D"#000000=
" face=3D"Times New Roman" size=3D"3">

</font><p style=3D"margin:0cm 0cm 0pt"><span style=3D"font-family:&quot;Ari=
al&quot;,sans-serif;font-size:9pt"><font color=3D"#000000">=C2=A0</font></s=
pan></p><font color=3D"#000000" face=3D"Times New Roman" size=3D"3">

</font></div></div></div>
<br><div class=3D"gmail_quote">On 14 May 2016 at 12:23, no jihun <span dir=
=3D"ltr">&lt;<a href=3D"mailto:jeesim2@gmail.com" target=3D"_blank">jeesim2=
@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><p dir=
=3D"ltr">Hello.</p>
<p dir=3D"ltr">I want to ask the correct bucketing and tablesample way.</p>
<p dir=3D"ltr">There is a table X which I created by</p>
<p dir=3D"ltr">CREATE TABLE `X`(`action_id` string,`classifier` string)<br>
CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS<br>
STORED AS ORC<br></p>
<p dir=3D"ltr">Then I inserted 500M of rows into X by</p>
<p dir=3D"ltr">set hive.enforce.bucketing=3Dtrue;<br>
INSERT OVERWRITE INTO X SELECT * FROM X_RAW<br></p>
<p dir=3D"ltr">Then I want to count or search some rows with condition. rou=
ghly,</p>
<p dir=3D"ltr">SELECT COUNT(*) FROM X WHERE action_id=3D&#39;aaa&#39; AND c=
lassifier=3D&#39;bbb&#39;<br></p>
<p dir=3D"ltr">But I&#39;d better to USE tablesample as I clustered X (acti=
on_id, classifier). So, the better query will be</p>
<p dir=3D"ltr">SELECT COUNT(*) FROM X <br>
TABLESAMPLE(BUCKET 1 OUT OF 256 ON=C2=A0 action_id, classifier)<br>
WHERE action_id=3D&#39;aaa&#39; AND classifier=3D&#39;bbb&#39;</p>
<p dir=3D"ltr">Is there any wrong above? But I can&#39;t not find any perfo=
rmance gain between these two query.</p>
<p dir=3D"ltr">query1 and RESULT( with no tablesample.)</p>
<p dir=3D"ltr">SELECT COUNT(*)) from X <br>
WHERE action_id=3D&#39;aaa&#39; and classifier=3D&#39;bbb&#39;</p>
<p dir=3D"ltr">------------------------------------------------------------=
--------------------<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 VERTICES=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0 STATUS=C2=A0 TOTAL=C2=A0 COMPLETED=C2=A0 RUNNING=C2=A0 PENDING=C2=A0=
 FAILED=C2=A0 KILLED<br>
---------------------------------------------------------------------------=
-----<br>
Map 1 ..........=C2=A0=C2=A0 SUCCEEDED=C2=A0=C2=A0=C2=A0 256=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0 256=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0<br>
Reducer 2 ......=C2=A0=C2=A0 SUCCEEDED=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0<br>
---------------------------------------------------------------------------=
-----<br>
VERTICES: 02/02=C2=A0 [=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D&gt;&gt;] 100%=C2=A0 ELAPSED TIME: 15.35 s=C2=A0=
=C2=A0=C2=A0 <br>
---------------------------------------------------------------------------=
-----<br>
It scans full data.<br><br></p>
<p dir=3D"ltr">query 2 and RESULT</p>
<p dir=3D"ltr">SELECT COUNT(*)) from X <br>
TABLESAMPLE(BUCKET 1 OUT OF 256 ON=C2=A0 action_id, classifier)<br>
WHERE action_id=3D&#39;aaa&#39; and classifier=3D&#39;bbb&#39;</p>
<p dir=3D"ltr">------------------------------------------------------------=
--------------------<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 VERTICES=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0 STATUS=C2=A0 TOTAL=C2=A0 COMPLETED=C2=A0 RUNNING=C2=A0 PENDING=C2=A0=
 FAILED=C2=A0 KILLED<br>
---------------------------------------------------------------------------=
-----<br>
Map 1 ..........=C2=A0=C2=A0 SUCCEEDED=C2=A0=C2=A0=C2=A0 256=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0 256=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0<br>
Reducer 2 ......=C2=A0=C2=A0 SUCCEEDED=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0<br>
---------------------------------------------------------------------------=
-----<br>
VERTICES: 02/02=C2=A0 [=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D&gt;&gt;] 100%=C2=A0 ELAPSED TIME: 15.82=C2=A0=
=C2=A0=C2=A0=C2=A0 s=C2=A0=C2=A0=C2=A0 <br>
---------------------------------------------------------------------------=
-----<br>
It ALSO scans full data.<br><br></p>
<p dir=3D"ltr">query 2 RESULT WHAT I EXPECTED.</p>
<p dir=3D"ltr">Result what I expected is something like...<br>
(use 1 map and relatively faster than without tabmesample)<br>
---------------------------------------------------------------------------=
-----<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 VERTICES=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0 STATUS=C2=A0 TOTAL=C2=A0 COMPLETED=C2=A0 RUNNING=C2=A0 PENDING=C2=A0=
 FAILED=C2=A0 KILLED<br>
---------------------------------------------------------------------------=
-----<br>
Map 1 ..........=C2=A0=C2=A0 SUCCEEDED=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0<br>
Reducer 2 ......=C2=A0=C2=A0 SUCCEEDED=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0<br>
---------------------------------------------------------------------------=
-----<br>
VERTICES: 02/02=C2=A0 [=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D&gt;&gt;] 100%=C2=A0 ELAPSED TIME: 3.xx=C2=A0=C2=
=A0=C2=A0=C2=A0 s=C2=A0=C2=A0=C2=A0 <br>
---------------------------------------------------------------------------=
-----</p>
<p dir=3D"ltr">Values of action_id and classifier are well distributed and =
there is no skewed data.</p>
<p dir=3D"ltr">So I want to ask you what will be a correct query that prune=
 and target specific bucket by multiple column?</p>
</blockquote></div><br></div>

--94eb2c055c36104e560532cbf81b--