hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: clustered bucket and tablesample
Date Sat, 14 May 2016 11:48:25 GMT
Is action_id can be created as a numeric column:

CREATE TABLE X ( action_id bigint,  ..)

Bucketing or hash partitioning best works on numeric columns with high
cardinality (say a primary key).

>From my old notes:

Bucketing in Hive refers to hash partitioning where a hashing function is
applied. Likewise an RDBMS like Oracle, Hive will apply a linear hashing
algorithm to prevent data from clustering within specific partitions.
Hashing is very effective if the column selected for bucketing has very
high selectivity like an ID column where selectivity (select
count(distinct(column))/count(column) ) = 1.  In this case, the created
partitions/ files will be as evenly sized as possible. In a nutshell
bucketing is a method to get data evenly distributed over many
partitions/files.  One should define the number of buckets by a power of
two -- 2^n,  like 2, 4, 8, 16 etc to achieve best results. Again bucketing
will help concurrency in Hive. It may even allow a partition wise join i.e.
a join between two tables that are bucketed on the same column with the
same number of buckets (anyone has tried this?)



One more things. When one defines the number of buckets at table creation
level in Hive, the number of partitions/files will be fixed. In contrast,
with partitioning you do not have this limitation.

can you do

show create table X

and send the output. please.



Thanks


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 May 2016 at 12:23, no jihun <jeesim2@gmail.com> wrote:

> Hello.
>
> I want to ask the correct bucketing and tablesample way.
>
> There is a table X which I created by
>
> CREATE TABLE `X`(`action_id` string,`classifier` string)
> CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
> STORED AS ORC
>
> Then I inserted 500M of rows into X by
>
> set hive.enforce.bucketing=true;
> INSERT OVERWRITE INTO X SELECT * FROM X_RAW
>
> Then I want to count or search some rows with condition. roughly,
>
> SELECT COUNT(*) FROM X WHERE action_id='aaa' AND classifier='bbb'
>
> But I'd better to USE tablesample as I clustered X (action_id,
> classifier). So, the better query will be
>
> SELECT COUNT(*) FROM X
> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
> WHERE action_id='aaa' AND classifier='bbb'
>
> Is there any wrong above? But I can't not find any performance gain
> between these two query.
>
> query1 and RESULT( with no tablesample.)
>
> SELECT COUNT(*)) from X
> WHERE action_id='aaa' and classifier='bbb'
>
>
> --------------------------------------------------------------------------------
>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
> KILLED
>
> --------------------------------------------------------------------------------
> Map 1 ..........   SUCCEEDED    256        256        0        0
> 0       0
> Reducer 2 ......   SUCCEEDED      1          1        0        0
> 0       0
>
> --------------------------------------------------------------------------------
> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 15.35
> s
>
> --------------------------------------------------------------------------------
> It scans full data.
>
> query 2 and RESULT
>
> SELECT COUNT(*)) from X
> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
> WHERE action_id='aaa' and classifier='bbb'
>
>
> --------------------------------------------------------------------------------
>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
> KILLED
>
> --------------------------------------------------------------------------------
> Map 1 ..........   SUCCEEDED    256        256        0        0
> 0       0
> Reducer 2 ......   SUCCEEDED      1          1        0        0
> 0       0
>
> --------------------------------------------------------------------------------
> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
> 15.82     s
>
> --------------------------------------------------------------------------------
> It ALSO scans full data.
>
> query 2 RESULT WHAT I EXPECTED.
>
> Result what I expected is something like...
> (use 1 map and relatively faster than without tabmesample)
>
> --------------------------------------------------------------------------------
>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
> KILLED
>
> --------------------------------------------------------------------------------
> Map 1 ..........   SUCCEEDED      1          1        0        0
> 0       0
> Reducer 2 ......   SUCCEEDED      1          1        0        0
> 0       0
>
> --------------------------------------------------------------------------------
> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
> 3.xx     s
>
> --------------------------------------------------------------------------------
>
> Values of action_id and classifier are well distributed and there is no
> skewed data.
>
> So I want to ask you what will be a correct query that prune and target
> specific bucket by multiple column?
>

Mime
View raw message