hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: clustered bucket and tablesample
Date Sat, 14 May 2016 17:56:24 GMT
This is your code

SELECT COUNT(*) FROM X
TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
WHERE action_id='aaa' AND classifier='bbb'

Well I have a table dummy with 1 billion rows imported from Oracle as ORC
format

hive> show create table dummy;
OK
CREATE TABLE `dummy`(
  `id` int,
  `clustered` int,
  `scattered` int,
  `randomised` int,
  `random_string` varchar(50),
  `small_vc` varchar(10),
  `padding` varchar(10))



*CLUSTERED BY (  id)INTO 256 BUCKETS*ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/dummy'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
  'numFiles'='1',
  'numRows'='100000000',
  'orc.bloom.filter.columns'='ID',
  'orc.bloom.filter.fpp'='0.05',
  'orc.compress'='SNAPPY',
  'orc.create.index'='true',
  'orc.row.index.stride'='10000',
  'orc.stripe.size'='16777216',
  'rawDataSize'='0',
  'totalSize'='5662644579',
  'transient_lastDdlTime'='1463245925')


If I turn on the plan for the following two cases. First a simple case

hive>

*EXPLAIN SELECT COUNT(1) FROM dummy    >  where id = 20;*OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
STAGE PLANS:
  Stage: Stage-1
    Spark
      Edges:
        Reducer 2 <- Map 1 (GROUP, 1)
      DagName: hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:7
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: dummy
                  Statistics: Num rows: 100000000 Data size: 5662644736
Basic stats: COMPLETE Column stats: NONE
                  Filter Operator

*                   predicate: (id = 20) (type: boolean)  *
Statistics: Num rows: 50000000 Data size: 2831322368 Basic stats: COMPLETE
Column stats: NONE
                    Select Operator
                      Statistics: Num rows: 50000000 Data size: 2831322368
Basic stats: COMPLETE Column stats: NONE
                      Group By Operator
                        aggregations: count(1)
                        mode: hash
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
                        Reduce Output Operator
                          sort order:
                          Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
                          value expressions: _col0 (type: bigint)
        Reducer 2
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                mode: mergepartial
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
Time taken: 0.064 seconds, Fetched: 51 row(s)


Now we try with tablesample but  assigning predicate values inside the
bracket as below because you are looking in bucket 1 for those values and
you want optimizer to know that.



*hive>  EXPLAIN SELECT COUNT(1) FROM dummy    > TABLESAMPLE (BUCKET 1 OUT
OF 256 ON ID = 10)*    > ;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
STAGE PLANS:
  Stage: Stage-1
    Spark
      Edges:
        Reducer 2 <- Map 1 (GROUP, 1)
      DagName: hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:10
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: dummy
                  Statistics: Num rows: 100000000 Data size: 5662644736
Basic stats: COMPLETE Column stats: NONE
                  Filter Operator

*                   predicate: (((hash((id = 10)) & 2147483647) % 256) = 0)
(type: boolean)*                    Statistics: Num rows: 50000000 Data
size: 2831322368 Basic stats: COMPLETE Column stats: NONE
                    Select Operator
                      Statistics: Num rows: 50000000 Data size: 2831322368
Basic stats: COMPLETE Column stats: NONE
                      Group By Operator
                        aggregations: count(1)
                        mode: hash
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
                        Reduce Output Operator
                          sort order:
                          Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
                          value expressions: _col0 (type: bigint)
        Reducer 2
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                mode: mergepartial
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink


Otherwise I don't see much happening

hive>  EXPLAIN SELECT COUNT(1) FROM dummy
    > TABLESAMPLE (BUCKET 1 OUT OF 256 ON ID)
    > WHERE ID = 10;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
STAGE PLANS:
  Stage: Stage-1
    Spark
      Edges:
        Reducer 2 <- Map 1 (GROUP, 1)
      DagName: hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:13
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: dummy
                  Statistics: Num rows: 100000000 Data size: 5662644736
Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: (false and (id = 10)) (type: boolean)
                    Statistics: Num rows: 25000000 Data size: 1415661184
Basic stats: COMPLETE Column stats: NONE
                    Select Operator
                      Statistics: Num rows: 25000000 Data size: 1415661184
Basic stats: COMPLETE Column stats: NONE
                      Group By Operator
                        aggregations: count(1)
                        mode: hash
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
                        Reduce Output Operator
                          sort order:
                          Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
                          value expressions: _col0 (type: bigint)
        Reducer 2
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                mode: mergepartial
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

In general in my experience bucketing in ORC is the only area where ORC
tables come handy.

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 May 2016 at 13:38, no jihun <jeesim2@gmail.com> wrote:

> ah, as i mentioned
> both field type of action_id and classifier is STRING. and I can not
> change the type.
>
> CREATE TABLE `X`(`action_id` string,`classifier` string)
> CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
> STORED AS ORC
>
> I use two fields for hash then bucketing because each one field is not so
> well distributed.
>
> my concern is not about the strong hash source but about How can I
> tablesample to the a bucket by field value what provided by 'where clause'
>
> when I clustered by string fields which one is right for tablesample?
> 1. provide fields
> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>
> 2. provide values
> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  'aaa', 'bbb')
> 2016. 5. 14. 오후 8:48에 "Mich Talebzadeh" <mich.talebzadeh@gmail.com>님이
작성:
>
> Is action_id can be created as a numeric column:
>>
>> CREATE TABLE X ( action_id bigint,  ..)
>>
>> Bucketing or hash partitioning best works on numeric columns with high
>> cardinality (say a primary key).
>>
>> From my old notes:
>>
>> Bucketing in Hive refers to hash partitioning where a hashing function is
>> applied. Likewise an RDBMS like Oracle, Hive will apply a linear hashing
>> algorithm to prevent data from clustering within specific partitions.
>> Hashing is very effective if the column selected for bucketing has very
>> high selectivity like an ID column where selectivity (select
>> count(distinct(column))/count(column) ) = 1.  In this case, the created
>> partitions/ files will be as evenly sized as possible. In a nutshell
>> bucketing is a method to get data evenly distributed over many
>> partitions/files.  One should define the number of buckets by a power of
>> two -- 2^n,  like 2, 4, 8, 16 etc to achieve best results. Again bucketing
>> will help concurrency in Hive. It may even allow a partition wise join i.e.
>> a join between two tables that are bucketed on the same column with the
>> same number of buckets (anyone has tried this?)
>>
>>
>>
>> One more things. When one defines the number of buckets at table creation
>> level in Hive, the number of partitions/files will be fixed. In contrast,
>> with partitioning you do not have this limitation.
>>
>> can you do
>>
>> show create table X
>>
>> and send the output. please.
>>
>>
>>
>> Thanks
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 14 May 2016 at 12:23, no jihun <jeesim2@gmail.com> wrote:
>>
>>> Hello.
>>>
>>> I want to ask the correct bucketing and tablesample way.
>>>
>>> There is a table X which I created by
>>>
>>> CREATE TABLE `X`(`action_id` string,`classifier` string)
>>> CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
>>> STORED AS ORC
>>>
>>> Then I inserted 500M of rows into X by
>>>
>>> set hive.enforce.bucketing=true;
>>> INSERT OVERWRITE INTO X SELECT * FROM X_RAW
>>>
>>> Then I want to count or search some rows with condition. roughly,
>>>
>>> SELECT COUNT(*) FROM X WHERE action_id='aaa' AND classifier='bbb'
>>>
>>> But I'd better to USE tablesample as I clustered X (action_id,
>>> classifier). So, the better query will be
>>>
>>> SELECT COUNT(*) FROM X
>>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>>> WHERE action_id='aaa' AND classifier='bbb'
>>>
>>> Is there any wrong above? But I can't not find any performance gain
>>> between these two query.
>>>
>>> query1 and RESULT( with no tablesample.)
>>>
>>> SELECT COUNT(*)) from X
>>> WHERE action_id='aaa' and classifier='bbb'
>>>
>>> ------------------------------------------------------------
>>> --------------------
>>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING
>>> FAILED  KILLED
>>> ------------------------------------------------------------
>>> --------------------
>>> Map 1 ..........   SUCCEEDED    256        256        0        0
>>> 0       0
>>> Reducer 2 ......   SUCCEEDED      1          1        0        0
>>> 0       0
>>> ------------------------------------------------------------
>>> --------------------
>>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
>>> 15.35 s
>>> ------------------------------------------------------------
>>> --------------------
>>> It scans full data.
>>>
>>> query 2 and RESULT
>>>
>>> SELECT COUNT(*)) from X
>>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>>> WHERE action_id='aaa' and classifier='bbb'
>>>
>>> ------------------------------------------------------------
>>> --------------------
>>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING
>>> FAILED  KILLED
>>> ------------------------------------------------------------
>>> --------------------
>>> Map 1 ..........   SUCCEEDED    256        256        0        0
>>> 0       0
>>> Reducer 2 ......   SUCCEEDED      1          1        0        0
>>> 0       0
>>> ------------------------------------------------------------
>>> --------------------
>>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
>>> 15.82     s
>>> ------------------------------------------------------------
>>> --------------------
>>> It ALSO scans full data.
>>>
>>> query 2 RESULT WHAT I EXPECTED.
>>>
>>> Result what I expected is something like...
>>> (use 1 map and relatively faster than without tabmesample)
>>> ------------------------------------------------------------
>>> --------------------
>>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING
>>> FAILED  KILLED
>>> ------------------------------------------------------------
>>> --------------------
>>> Map 1 ..........   SUCCEEDED      1          1        0        0
>>> 0       0
>>> Reducer 2 ......   SUCCEEDED      1          1        0        0
>>> 0       0
>>> ------------------------------------------------------------
>>> --------------------
>>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
>>> 3.xx     s
>>> ------------------------------------------------------------
>>> --------------------
>>>
>>> Values of action_id and classifier are well distributed and there is no
>>> skewed data.
>>>
>>> So I want to ask you what will be a correct query that prune and target
>>> specific bucket by multiple column?
>>>
>>
>>

Mime
View raw message