Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 05D6A200ABD for ; Sat, 14 May 2016 14:39:02 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 0437B160969; Sat, 14 May 2016 12:39:02 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id CA0E4160131 for ; Sat, 14 May 2016 14:39:00 +0200 (CEST) Received: (qmail 70153 invoked by uid 500); 14 May 2016 12:38:59 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 70143 invoked by uid 99); 14 May 2016 12:38:59 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 14 May 2016 12:38:59 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 449721A0435 for ; Sat, 14 May 2016 12:38:59 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.429 X-Spam-Level: * X-Spam-Status: No, score=1.429 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 8o1fpzEwbHRy for ; Sat, 14 May 2016 12:38:57 +0000 (UTC) Received: from mail-oi0-f53.google.com (mail-oi0-f53.google.com [209.85.218.53]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id 7691C5F477 for ; Sat, 14 May 2016 12:38:56 +0000 (UTC) Received: by mail-oi0-f53.google.com with SMTP id v145so208603518oie.0 for ; Sat, 14 May 2016 05:38:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to; bh=U60lNwPzzhPdpJ8GjHfW7egogDTUyI5zBusVKtrNE7E=; b=hzGlHng3d9pvyO82D47yCjE3bFFEk/NnDm5v5rWSQ23LTKKZFjmlSCL1isksPN0157 Lh7cIBIfn2Rn+JqYUe6Iu8LxnnFlMGTXGWKkGYhhrcIJsmXXyCchwzQNQ4IovawuvqfW a2Qekxfdc+LtnXIj4BHaUAN2UNYKcZ+FuY/TF0MvkkfFk6hPDBlUAji1baiX0CgSI7eB 1+QrXe/hEo9cmNpuaX9dylioTqGLTVqVbjx1irN7sfCX46Yq601X7McqXVq4Kcmc7qjs UQyXDvgFgVra4e6qOVpiAsosZWT33hMDaVpBOjGyYKGe4RRQNl4HwJtxxVQuBd49SKnC bbpg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to; bh=U60lNwPzzhPdpJ8GjHfW7egogDTUyI5zBusVKtrNE7E=; b=erl04PMyejPNJC7JIUvBSaOdTW+9hEZA9YC60YbA7nrJnQD8kqw2LZb5/m3BhTyEWL mM7u5LnxJOgKsDuIf6Rg3P9j6XA1q5lqsyqHx9xaiS9RCmCxP44bfKW7v2DlOiHxnvq+ ykCrZInYG7VQpHboNocAnHBnk/s0Pfx4Du4R2AKezXyM4whcLSzszdfPvRDe2/2b5C/B f/CEsg3k9I4YmAOlJcfDZfJ+TD9Vru6GTAxMZVSeynGfrvJuHbsudGcbhcb9oug1o6pd 3WZ/PXi1BLqjuOZw+cj0GoDcd1l622NEdkQPp31r+/v23HWLjbP5VIOaenvMyVHFoPYb F3Lw== X-Gm-Message-State: AOPr4FW7M5vneTA9tETciBMtMsFLDxCm6LYON/P13yztqY/k4oQ5pnKhkXp4liDvR6axBIUZkQUbkJybUkQnAA== MIME-Version: 1.0 X-Received: by 10.157.49.118 with SMTP id v51mr12076014otd.97.1463229535518; Sat, 14 May 2016 05:38:55 -0700 (PDT) Received: by 10.157.63.74 with HTTP; Sat, 14 May 2016 05:38:55 -0700 (PDT) Received: by 10.157.63.74 with HTTP; Sat, 14 May 2016 05:38:55 -0700 (PDT) In-Reply-To: References: Date: Sat, 14 May 2016 21:38:55 +0900 Message-ID: Subject: Re: clustered bucket and tablesample From: no jihun To: user@hive.apache.org Content-Type: multipart/alternative; boundary=001a113d05f0b08ba40532ccac89 archived-at: Sat, 14 May 2016 12:39:02 -0000 --001a113d05f0b08ba40532ccac89 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable ah, as i mentioned both field type of action_id and classifier is STRING. and I can not change the type. CREATE TABLE `X`(`action_id` string,`classifier` string) CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS STORED AS ORC I use two fields for hash then bucketing because each one field is not so well distributed. my concern is not about the strong hash source but about How can I tablesample to the a bucket by field value what provided by 'where clause' when I clustered by string fields which one is right for tablesample? 1. provide fields TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier) 2. provide values TABLESAMPLE(BUCKET 1 OUT OF 256 ON 'aaa', 'bbb') 2016. 5. 14. =EC=98=A4=ED=9B=84 8:48=EC=97=90 "Mich Talebzadeh" =EB=8B=98=EC=9D=B4 =EC=9E=91=EC=84=B1: > Is action_id can be created as a numeric column: > > CREATE TABLE X ( action_id bigint, ..) > > Bucketing or hash partitioning best works on numeric columns with high > cardinality (say a primary key). > > From my old notes: > > Bucketing in Hive refers to hash partitioning where a hashing function is > applied. Likewise an RDBMS like Oracle, Hive will apply a linear hashing > algorithm to prevent data from clustering within specific partitions. > Hashing is very effective if the column selected for bucketing has very > high selectivity like an ID column where selectivity (select > count(distinct(column))/count(column) ) =3D 1. In this case, the created > partitions/ files will be as evenly sized as possible. In a nutshell > bucketing is a method to get data evenly distributed over many > partitions/files. One should define the number of buckets by a power of > two -- 2^n, like 2, 4, 8, 16 etc to achieve best results. Again bucketin= g > will help concurrency in Hive. It may even allow a partition wise join i.= e. > a join between two tables that are bucketed on the same column with the > same number of buckets (anyone has tried this?) > > > > One more things. When one defines the number of buckets at table creation > level in Hive, the number of partitions/files will be fixed. In contrast, > with partitioning you do not have this limitation. > > can you do > > show create table X > > and send the output. please. > > > > Thanks > > > Dr Mich Talebzadeh > > > > LinkedIn * https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJ= d6zP6AcPCCdOABUrV8Pw > * > > > > http://talebzadehmich.wordpress.com > > > > On 14 May 2016 at 12:23, no jihun wrote: > >> Hello. >> >> I want to ask the correct bucketing and tablesample way. >> >> There is a table X which I created by >> >> CREATE TABLE `X`(`action_id` string,`classifier` string) >> CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS >> STORED AS ORC >> >> Then I inserted 500M of rows into X by >> >> set hive.enforce.bucketing=3Dtrue; >> INSERT OVERWRITE INTO X SELECT * FROM X_RAW >> >> Then I want to count or search some rows with condition. roughly, >> >> SELECT COUNT(*) FROM X WHERE action_id=3D'aaa' AND classifier=3D'bbb' >> >> But I'd better to USE tablesample as I clustered X (action_id, >> classifier). So, the better query will be >> >> SELECT COUNT(*) FROM X >> TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier) >> WHERE action_id=3D'aaa' AND classifier=3D'bbb' >> >> Is there any wrong above? But I can't not find any performance gain >> between these two query. >> >> query1 and RESULT( with no tablesample.) >> >> SELECT COUNT(*)) from X >> WHERE action_id=3D'aaa' and classifier=3D'bbb' >> >> >> ------------------------------------------------------------------------= -------- >> VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED >> KILLED >> >> ------------------------------------------------------------------------= -------- >> Map 1 .......... SUCCEEDED 256 256 0 0 >> 0 0 >> Reducer 2 ...... SUCCEEDED 1 1 0 0 >> 0 0 >> >> ------------------------------------------------------------------------= -------- >> VERTICES: 02/02 [=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D>>] 100% ELAPSED TIME: 15.35 >> s >> >> ------------------------------------------------------------------------= -------- >> It scans full data. >> >> query 2 and RESULT >> >> SELECT COUNT(*)) from X >> TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier) >> WHERE action_id=3D'aaa' and classifier=3D'bbb' >> >> >> ------------------------------------------------------------------------= -------- >> VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED >> KILLED >> >> ------------------------------------------------------------------------= -------- >> Map 1 .......... SUCCEEDED 256 256 0 0 >> 0 0 >> Reducer 2 ...... SUCCEEDED 1 1 0 0 >> 0 0 >> >> ------------------------------------------------------------------------= -------- >> VERTICES: 02/02 [=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D>>] 100% ELAPSED TIME: >> 15.82 s >> >> ------------------------------------------------------------------------= -------- >> It ALSO scans full data. >> >> query 2 RESULT WHAT I EXPECTED. >> >> Result what I expected is something like... >> (use 1 map and relatively faster than without tabmesample) >> >> ------------------------------------------------------------------------= -------- >> VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED >> KILLED >> >> ------------------------------------------------------------------------= -------- >> Map 1 .......... SUCCEEDED 1 1 0 0 >> 0 0 >> Reducer 2 ...... SUCCEEDED 1 1 0 0 >> 0 0 >> >> ------------------------------------------------------------------------= -------- >> VERTICES: 02/02 [=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D>>] 100% ELAPSED TIME: >> 3.xx s >> >> ------------------------------------------------------------------------= -------- >> >> Values of action_id and classifier are well distributed and there is no >> skewed data. >> >> So I want to ask you what will be a correct query that prune and target >> specific bucket by multiple column? >> > > --001a113d05f0b08ba40532ccac89 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

ah, as i mentioned
both field type of action_id and classifier is STRING. and I can not change= the type.

CREATE TABLE `X`(`action_id` string,`classifier` string)
CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
STORED AS ORC

I use two fields for hash then bucketing because each one fi= eld is not so well distributed.

my concern is not about the strong hash source but about How= can I tablesample to the a bucket by field value what provided by 'whe= re clause'

when I clustered by string fields which one is right for tab= lesample?
1. provide fields
TABLESAMPLE(BUCKET 1 OUT OF 256 ON=C2=A0 action_id, classifier)

2. provide values
TABLESAMPLE(BUCKET 1 OUT OF 256 ON=C2=A0 'aaa', 'bbb')

2016. 5. 14. =EC=98=A4=ED=9B=84 8:48=EC=97=90 &q= uot;Mich Talebzadeh" <= mich.talebzadeh@gmail.com>=EB=8B=98=EC=9D=B4 =EC=9E=91=EC=84=B1:
I= s action_id can be created as a numeric column:

CR= EATE TABLE X (=C2=A0action_id bigint,=C2=A0 ..)

Bu= cketing or hash partitioning best works on numeric columns with high cardin= ality (say a primary key).

From my old notes:

= Bucketing in Hive refers to hash partitioning where a hashing function is applied. Likewise an RDBMS like Oracle, Hive will apply= a linear hashing algorithm to prevent data from clustering within specific partitions. Hashing is very effective if the column selected for bucketing = has very high selectivity like an ID column where selectivity (select count(distinct(column))/count(column) ) =3D 1.=C2=A0 In this case, the crea= ted partitions/ files will be as evenly sized as possible. In a nutshell bucket= ing is a method to get data evenly distributed over many partitions/files.=C2= =A0 One should define the number of buckets by a power of two -- 2^n,=C2=A0 lik= e 2, 4, 8, 16 etc to achieve best results. Again bucketing will help concurrency= in Hive. It may even allow a partition wise join i.e. a join between two table= s that are bucketed on the same column with the same number of buckets (anyon= e has tried this?)

= =C2=A0

= One more things. When one defines the number of buckets at table creation level in Hive, the number of partitions/files wil= l be fixed. In contrast, with partitioning you do not have this limitation.


can you do

show = create table X

and send the output. please.
=



Thanks



On 14 May 2016 at 12:23, no jihun <jeesim2= @gmail.com> wrote:

Hello.

I want to ask the correct bucketing and tablesample way.

There is a table X which I created by

CREATE TABLE `X`(`action_id` string,`classifier` string)
CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
STORED AS ORC

Then I inserted 500M of rows into X by

set hive.enforce.bucketing=3Dtrue;
INSERT OVERWRITE INTO X SELECT * FROM X_RAW

Then I want to count or search some rows with condition. rou= ghly,

SELECT COUNT(*) FROM X WHERE action_id=3D'aaa' AND c= lassifier=3D'bbb'

But I'd better to USE tablesample as I clustered X (acti= on_id, classifier). So, the better query will be

SELECT COUNT(*) FROM X
TABLESAMPLE(BUCKET 1 OUT OF 256 ON=C2=A0 action_id, classifier)
WHERE action_id=3D'aaa' AND classifier=3D'bbb'

Is there any wrong above? But I can't not find any perfo= rmance gain between these two query.

query1 and RESULT( with no tablesample.)

SELECT COUNT(*)) from X
WHERE action_id=3D'aaa' and classifier=3D'bbb'

------------------------------------------------------------= --------------------
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 VERTICES=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 STATUS=C2=A0 TOTAL=C2=A0 COMPLETED=C2=A0 RUNNING=C2=A0 PENDING=C2=A0= FAILED=C2=A0 KILLED
---------------------------------------------------------------------------= -----
Map 1 ..........=C2=A0=C2=A0 SUCCEEDED=C2=A0=C2=A0=C2=A0 256=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 256=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0
Reducer 2 ......=C2=A0=C2=A0 SUCCEEDED=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0
---------------------------------------------------------------------------= -----
VERTICES: 02/02=C2=A0 [=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D>>] 100%=C2=A0 ELAPSED TIME: 15.35 s=C2=A0= =C2=A0=C2=A0
---------------------------------------------------------------------------= -----
It scans full data.

query 2 and RESULT

SELECT COUNT(*)) from X
TABLESAMPLE(BUCKET 1 OUT OF 256 ON=C2=A0 action_id, classifier)
WHERE action_id=3D'aaa' and classifier=3D'bbb'

------------------------------------------------------------= --------------------
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 VERTICES=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 STATUS=C2=A0 TOTAL=C2=A0 COMPLETED=C2=A0 RUNNING=C2=A0 PENDING=C2=A0= FAILED=C2=A0 KILLED
---------------------------------------------------------------------------= -----
Map 1 ..........=C2=A0=C2=A0 SUCCEEDED=C2=A0=C2=A0=C2=A0 256=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 256=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0
Reducer 2 ......=C2=A0=C2=A0 SUCCEEDED=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0
---------------------------------------------------------------------------= -----
VERTICES: 02/02=C2=A0 [=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D>>] 100%=C2=A0 ELAPSED TIME: 15.82=C2=A0= =C2=A0=C2=A0=C2=A0 s=C2=A0=C2=A0=C2=A0
---------------------------------------------------------------------------= -----
It ALSO scans full data.

query 2 RESULT WHAT I EXPECTED.

Result what I expected is something like...
(use 1 map and relatively faster than without tabmesample)
---------------------------------------------------------------------------= -----
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 VERTICES=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 STATUS=C2=A0 TOTAL=C2=A0 COMPLETED=C2=A0 RUNNING=C2=A0 PENDING=C2=A0= FAILED=C2=A0 KILLED
---------------------------------------------------------------------------= -----
Map 1 ..........=C2=A0=C2=A0 SUCCEEDED=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0
Reducer 2 ......=C2=A0=C2=A0 SUCCEEDED=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0
---------------------------------------------------------------------------= -----
VERTICES: 02/02=C2=A0 [=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D>>] 100%=C2=A0 ELAPSED TIME: 3.xx=C2=A0=C2= =A0=C2=A0=C2=A0 s=C2=A0=C2=A0=C2=A0
---------------------------------------------------------------------------= -----

Values of action_id and classifier are well distributed and = there is no skewed data.

So I want to ask you what will be a correct query that prune= and target specific bucket by multiple column?


--001a113d05f0b08ba40532ccac89--