hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mich Talebzadeh" <>
Subject RE: Hive Bucketing
Date Wed, 27 Jan 2016 20:34:15 GMT
My conclusion is that from performance point of view as of Hive 1.2.1
bucketing has limited scope if any. I cannot comment on Hive 2 as I have not
got it.


Dr Mich Talebzadeh


Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN:
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Technology Ltd, its subsidiaries nor their
employees accept any responsibility.

-----Original Message-----
From: Gopal Vijayaraghavan [] On Behalf Of Gopal
Sent: 26 January 2016 21:36
Subject: Re: Hive Bucketing

> Ok so what is the resolution here? My understanding is that bucketing 
>does not improve the performance. Is that correct?

There are no right answers here - I spend a lot of time fixing over-zealous
optimization attempts



If you use bucketing to speed up a query without understanding the interplay
with other parameters involved (total # of partitions, size of each
partition, bucket col type, skew towards buckets), then it generally ends up
in disaster. And depending on whether you pay for HDP support or not, I
might turn out to be your tow-truck driver.

Bucketing in hive-1.0 is not a general performance feature - it was intended
as a scalability feature for JOINs.

Bucketed map-joins can be really slow if you're limited to MapReduce

Even excluding the execution on cluster, both Spark & Mapreduce serializes
the lookup tables before the query kicks off.

That means there is a pause before the big table can be read (or even tasks
scheduled) - Tez pipelines the scheduling with the generation, so it's not
so bad (through different EdgeManagers).

But a bucket map-join is still slower than a regular map-join within a
single task, because the Tez can cache the hashtable for the regular join as
it is the same one for any split it encounters in the vertex.

*SO*, if your mapjoins are OOM'ing you might want to consider bucketing -
otherwise they're wasted CPUs for JOINs.

With Hive-2.0, if Tez thinks your map-join might OOM, it might bucket data
at runtime & produce a dynamic version of a bucketed map-join

This feature by itself pays for all the complexity Tez has with its runtime
edge reconfiguration.

Back to filters.

You can get speedups in filter queries with bucketing, even in Hive-1.0 (if
the data is sorted & clustered on same col).

Even then, the split-elimination with predicate lookups is unavailable to
any engine using CombineHiveInputFormat (i.e MapReduce & Spark).

So you have to be using ORC+Tez & sorting aligned precisely along your
lookup direction.

If you're on Parquet, it gets a little worse (+~3s or so for each task wave
spin up) and skip reading those rows entirely.


View raw message