hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mich Talebzadeh" <>
Subject Bucketing in Hive
Date Tue, 26 Jan 2016 21:44:00 GMT


There are number of questions brought up about Hive Bucketing. As I see -  it is another name
for hash partitioning (assuming that Hive partitioning is effectively range partitioning).
I borrow these terms (range and hash partitioning) from industry standard as they are commonly
used among RDBMS .


Excuse my ignorance, I am at loss to know why hash partitioning is called bucketing in Hive?
Someone may throw light on what are the main differences if any.


As I see it in RDBMS Partitioning has these benefits:


1.    Availability -- each partition can reside on a different segment/device. Hence a problem
with a device will take out a slice of the table's data instead of the whole thing. 

2.     Manageability -- partitioning provides a mechanism for splitting whole table jobs into
clear batches. Partition exchange can make it easier to bulk load data. Getting rid of fragmentation
, moving older partitions to lower tier storage, updating stats etc 

3.    Performance -- Partition elimination 


Hash partitioning is where a hashing function is applied. RDBMS will apply a linear hashing
algorithm f(x) like mod (x) to prevent data from clustering within specific partitions. Hashing
is very effective if the column selected for partitioning has very high selectivity like an
ID column, where selectivity (select count(distinct(column))/count(column) ) = 1.  In this
case, the created partitions will be as evenly sized as possible. In a nutshell hash partitioning
is a method to get data evenly distributed over many files. One should define the number of
hash partitions by a power of two -- 2^n,  like 2, 4, 8, 16 etc. to achieve best results.
I am pretty sure this definition applies to Hive bucketing although hashing is far simpler.


As for performance, physical co-location of records can speed up some queries- those which
are searching records by a defined range of keys. However, any queries which do not match
the grain of the query will not  perform faster (and may even perform slower) than a non-hash-partitioned
(reads bucketing) table. 


IMO, Hash partitioning is unlikely to provide performance benefits, precisely because it shuffles
the keys across the whole table. It will provide the availability and manageability benefits
of partitioning. Unlike standard range partitioning, the number of buckets is fixed so it
does not fluctuate with data. It may even allow a partition wise join i.e. a join between
two tables that are hash partitioned (bucketed) on the same column with the same number of
partitions (buckets), thus helping certain queries.





Dr Mich Talebzadeh




Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly <> 


NOTE: The information in this email is proprietary and confidential. This message is for the
designated recipient only, if you are not the intended recipient, you should destroy it immediately.
Any information in this message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility
of the recipient to ensure that this email is virus free, therefore neither Peridale Technology
Ltd, its subsidiaries nor their employees accept any responsibility.


View raw message