Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 00731200ABD for ; Sat, 14 May 2016 13:48:38 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id F305A160969; Sat, 14 May 2016 11:48:37 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id C79F7160131 for ; Sat, 14 May 2016 13:48:36 +0200 (CEST) Received: (qmail 29963 invoked by uid 500); 14 May 2016 11:48:35 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 29953 invoked by uid 99); 14 May 2016 11:48:35 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 14 May 2016 11:48:35 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 1C8E41A062A for ; Sat, 14 May 2016 11:48:35 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.179 X-Spam-Level: * X-Spam-Status: No, score=1.179 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id zSa91S6Hwy3L for ; Sat, 14 May 2016 11:48:32 +0000 (UTC) Received: from mail-qk0-f174.google.com (mail-qk0-f174.google.com [209.85.220.174]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id B4CF65F3F4 for ; Sat, 14 May 2016 11:48:31 +0000 (UTC) Received: by mail-qk0-f174.google.com with SMTP id n63so73794236qkf.0 for ; Sat, 14 May 2016 04:48:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to; bh=Mmz7SSSXK0Et/0ug60PfhJ4ht+57tr2zt/9IWgF51Z0=; b=U7LMAuLayj9q/2vgQMu+gGA3/WGDAYlF/pQhtvh3ilBE+b0an0YKeU6Ia4+iCkez7W FvlPyGR01pBKzURc17ffVYg6DINmpXoQbKP1IUSr5eR04iVeCNpJEoWM+Sl9BdYYu4AA lNShPsp38I0kDnLflBdnsAwxVENRjWjE0EUN6H9Iv9vR5osb3MKPZb6rVW3Hql9E1cCz F86HjzmoN6sZ5zont1W4eVrdKOI9e3/O3Sji9iU57Q+I8+DAqq6k0WcSNQA1cdZZd+lt RjJxDfzxy86se4qLhSyv3/H0NwDSWWaj/eQ9cYrchWVFrn2IQRP7DTq0dFIoX8GtBfBA BvgA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to; bh=Mmz7SSSXK0Et/0ug60PfhJ4ht+57tr2zt/9IWgF51Z0=; b=FlfeoBAC9t9UuTFL0vEVPKATvhsej0eeg3znBCn7hvJc/+sqQfFfZ0n1jsPUzr16KZ yhAT8+ipnFQJy+n4/NrD8IHpb+XLjtY1xcV3CryotOpfHOXFHQAKrZpnPCW4c0NFWrta +JtmWNFhZ+rv70Q1EG9mCGqVCrtHiEq1vUdXc2TLmglJmOkCYdVUvUle6NU/L39KOeRP kUQYeWaVNg4+JarmLSQ3f6B9WBzWSYmaXwU7z/VevQMliAavMVAT++6dYvON5lL7Y5w2 MRx2TwoUJfoj4hYkxNQgU8DpxZzVuyL+DfR9CC1+nCHFof2Xvl+HnPo1FC3AKXZ9JoyX C2lA== X-Gm-Message-State: AOPr4FWaMqOf9s8AslYG7pPWMrXmTdiJuqOp0RRDzP7SZB16brVP4kRYPZsp26VxyL5uC/a8Pfic3thPobTUHw== MIME-Version: 1.0 X-Received: by 10.55.117.10 with SMTP id q10mr21910959qkc.103.1463226505118; Sat, 14 May 2016 04:48:25 -0700 (PDT) Received: by 10.55.70.69 with HTTP; Sat, 14 May 2016 04:48:25 -0700 (PDT) In-Reply-To: References: Date: Sat, 14 May 2016 12:48:25 +0100 Message-ID: Subject: Re: clustered bucket and tablesample From: Mich Talebzadeh To: user Content-Type: multipart/alternative; boundary=94eb2c055c36104e560532cbf81b archived-at: Sat, 14 May 2016 11:48:38 -0000 --94eb2c055c36104e560532cbf81b Content-Type: text/plain; charset=UTF-8 Is action_id can be created as a numeric column: CREATE TABLE X ( action_id bigint, ..) Bucketing or hash partitioning best works on numeric columns with high cardinality (say a primary key). From my old notes: Bucketing in Hive refers to hash partitioning where a hashing function is applied. Likewise an RDBMS like Oracle, Hive will apply a linear hashing algorithm to prevent data from clustering within specific partitions. Hashing is very effective if the column selected for bucketing has very high selectivity like an ID column where selectivity (select count(distinct(column))/count(column) ) = 1. In this case, the created partitions/ files will be as evenly sized as possible. In a nutshell bucketing is a method to get data evenly distributed over many partitions/files. One should define the number of buckets by a power of two -- 2^n, like 2, 4, 8, 16 etc to achieve best results. Again bucketing will help concurrency in Hive. It may even allow a partition wise join i.e. a join between two tables that are bucketed on the same column with the same number of buckets (anyone has tried this?) One more things. When one defines the number of buckets at table creation level in Hive, the number of partitions/files will be fixed. In contrast, with partitioning you do not have this limitation. can you do show create table X and send the output. please. Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 14 May 2016 at 12:23, no jihun wrote: > Hello. > > I want to ask the correct bucketing and tablesample way. > > There is a table X which I created by > > CREATE TABLE `X`(`action_id` string,`classifier` string) > CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS > STORED AS ORC > > Then I inserted 500M of rows into X by > > set hive.enforce.bucketing=true; > INSERT OVERWRITE INTO X SELECT * FROM X_RAW > > Then I want to count or search some rows with condition. roughly, > > SELECT COUNT(*) FROM X WHERE action_id='aaa' AND classifier='bbb' > > But I'd better to USE tablesample as I clustered X (action_id, > classifier). So, the better query will be > > SELECT COUNT(*) FROM X > TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier) > WHERE action_id='aaa' AND classifier='bbb' > > Is there any wrong above? But I can't not find any performance gain > between these two query. > > query1 and RESULT( with no tablesample.) > > SELECT COUNT(*)) from X > WHERE action_id='aaa' and classifier='bbb' > > > -------------------------------------------------------------------------------- > VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED > KILLED > > -------------------------------------------------------------------------------- > Map 1 .......... SUCCEEDED 256 256 0 0 > 0 0 > Reducer 2 ...... SUCCEEDED 1 1 0 0 > 0 0 > > -------------------------------------------------------------------------------- > VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 15.35 > s > > -------------------------------------------------------------------------------- > It scans full data. > > query 2 and RESULT > > SELECT COUNT(*)) from X > TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier) > WHERE action_id='aaa' and classifier='bbb' > > > -------------------------------------------------------------------------------- > VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED > KILLED > > -------------------------------------------------------------------------------- > Map 1 .......... SUCCEEDED 256 256 0 0 > 0 0 > Reducer 2 ...... SUCCEEDED 1 1 0 0 > 0 0 > > -------------------------------------------------------------------------------- > VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: > 15.82 s > > -------------------------------------------------------------------------------- > It ALSO scans full data. > > query 2 RESULT WHAT I EXPECTED. > > Result what I expected is something like... > (use 1 map and relatively faster than without tabmesample) > > -------------------------------------------------------------------------------- > VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED > KILLED > > -------------------------------------------------------------------------------- > Map 1 .......... SUCCEEDED 1 1 0 0 > 0 0 > Reducer 2 ...... SUCCEEDED 1 1 0 0 > 0 0 > > -------------------------------------------------------------------------------- > VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: > 3.xx s > > -------------------------------------------------------------------------------- > > Values of action_id and classifier are well distributed and there is no > skewed data. > > So I want to ask you what will be a correct query that prune and target > specific bucket by multiple column? > --94eb2c055c36104e560532cbf81b Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Is action_id can be created as a numeric column:

CREATE TABLE X (=C2=A0action_id bigint,=C2=A0 ..)

Bucketing or hash partitioning best works on numeric c= olumns with high cardinality (say a primary key).

= From my old notes:

= Bucketing in Hive refers to hash partitioning where a hashing function is applied. Likewise an RDBMS like Oracle, Hive will apply= a linear hashing algorithm to prevent data from clustering within specific partitions. Hashing is very effective if the column selected for bucketing = has very high selectivity like an ID column where selectivity (select count(distinct(column))/count(column) ) =3D 1.=C2=A0 In this case, the crea= ted partitions/ files will be as evenly sized as possible. In a nutshell bucket= ing is a method to get data evenly distributed over many partitions/files.=C2= =A0 One should define the number of buckets by a power of two -- 2^n,=C2=A0 lik= e 2, 4, 8, 16 etc to achieve best results. Again bucketing will help concurrency= in Hive. It may even allow a partition wise join i.e. a join between two table= s that are bucketed on the same column with the same number of buckets (anyon= e has tried this?)

= =C2=A0

= One more things. When one defines the number of buckets at table creation level in Hive, the number of partitions/files wil= l be fixed. In contrast, with partitioning you do not have this limitation.


can you do

show = create table X

and send the output. please.
=



Thanks



On 14 May 2016 at 12:23, no jihun <jeesim2= @gmail.com> wrote:

Hello.

I want to ask the correct bucketing and tablesample way.

There is a table X which I created by

CREATE TABLE `X`(`action_id` string,`classifier` string)
CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
STORED AS ORC

Then I inserted 500M of rows into X by

set hive.enforce.bucketing=3Dtrue;
INSERT OVERWRITE INTO X SELECT * FROM X_RAW

Then I want to count or search some rows with condition. rou= ghly,

SELECT COUNT(*) FROM X WHERE action_id=3D'aaa' AND c= lassifier=3D'bbb'

But I'd better to USE tablesample as I clustered X (acti= on_id, classifier). So, the better query will be

SELECT COUNT(*) FROM X
TABLESAMPLE(BUCKET 1 OUT OF 256 ON=C2=A0 action_id, classifier)
WHERE action_id=3D'aaa' AND classifier=3D'bbb'

Is there any wrong above? But I can't not find any perfo= rmance gain between these two query.

query1 and RESULT( with no tablesample.)

SELECT COUNT(*)) from X
WHERE action_id=3D'aaa' and classifier=3D'bbb'

------------------------------------------------------------= --------------------
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 VERTICES=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 STATUS=C2=A0 TOTAL=C2=A0 COMPLETED=C2=A0 RUNNING=C2=A0 PENDING=C2=A0= FAILED=C2=A0 KILLED
---------------------------------------------------------------------------= -----
Map 1 ..........=C2=A0=C2=A0 SUCCEEDED=C2=A0=C2=A0=C2=A0 256=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 256=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0
Reducer 2 ......=C2=A0=C2=A0 SUCCEEDED=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0
---------------------------------------------------------------------------= -----
VERTICES: 02/02=C2=A0 [=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D>>] 100%=C2=A0 ELAPSED TIME: 15.35 s=C2=A0= =C2=A0=C2=A0
---------------------------------------------------------------------------= -----
It scans full data.

query 2 and RESULT

SELECT COUNT(*)) from X
TABLESAMPLE(BUCKET 1 OUT OF 256 ON=C2=A0 action_id, classifier)
WHERE action_id=3D'aaa' and classifier=3D'bbb'

------------------------------------------------------------= --------------------
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 VERTICES=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 STATUS=C2=A0 TOTAL=C2=A0 COMPLETED=C2=A0 RUNNING=C2=A0 PENDING=C2=A0= FAILED=C2=A0 KILLED
---------------------------------------------------------------------------= -----
Map 1 ..........=C2=A0=C2=A0 SUCCEEDED=C2=A0=C2=A0=C2=A0 256=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 256=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0
Reducer 2 ......=C2=A0=C2=A0 SUCCEEDED=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0
---------------------------------------------------------------------------= -----
VERTICES: 02/02=C2=A0 [=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D>>] 100%=C2=A0 ELAPSED TIME: 15.82=C2=A0= =C2=A0=C2=A0=C2=A0 s=C2=A0=C2=A0=C2=A0
---------------------------------------------------------------------------= -----
It ALSO scans full data.

query 2 RESULT WHAT I EXPECTED.

Result what I expected is something like...
(use 1 map and relatively faster than without tabmesample)
---------------------------------------------------------------------------= -----
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 VERTICES=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 STATUS=C2=A0 TOTAL=C2=A0 COMPLETED=C2=A0 RUNNING=C2=A0 PENDING=C2=A0= FAILED=C2=A0 KILLED
---------------------------------------------------------------------------= -----
Map 1 ..........=C2=A0=C2=A0 SUCCEEDED=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0
Reducer 2 ......=C2=A0=C2=A0 SUCCEEDED=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0
---------------------------------------------------------------------------= -----
VERTICES: 02/02=C2=A0 [=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D>>] 100%=C2=A0 ELAPSED TIME: 3.xx=C2=A0=C2= =A0=C2=A0=C2=A0 s=C2=A0=C2=A0=C2=A0
---------------------------------------------------------------------------= -----

Values of action_id and classifier are well distributed and = there is no skewed data.

So I want to ask you what will be a correct query that prune= and target specific bucket by multiple column?


--94eb2c055c36104e560532cbf81b--