hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lefty Leverenz <>
Subject Re: Skewed vs ListBucketing
Date Wed, 02 Jul 2014 08:38:14 GMT
Well, it turns out I dropped the ball on improving the list bucketing docs
back in April.  See the message thread "Skewed Tables"
that Mayur Gupta started on April 21 and Prasanth Jayachandran left in my
hands on April 28:

There are two different optimizations that use "SKEWED BY” keyword. One is
> skewed join optimization and other is list bucketing optimization. I think
> we need to mention this in some place so that users are aware of the
> difference between the two. “STORED AS DIRECTORIES” is used by only one
> optimization i.e list bucketing.

I'm open to suggestions about how to improve the doc, or I'll tackle it as
best I can with the given information.

-- Lefty

On Wed, Jul 2, 2014 at 1:55 AM, Lefty Leverenz <>

> The Skewed Tables
> <>
> section in the DDL wikidoc has more information which might be helpful.
> HIVE-3649 was just one of several jiras that added list bucketing in
> releases 0.10 and 0.11.  See HIVE-3026
> <> for links to the rest
> of them.  (The one that added DML support hasn't been documented yet:
> HIVE-3073 <>.)
> I'm revising the jira links in the wiki now.
> -- Lefty
> On Wed, Jul 2, 2014 at 1:25 AM, Lefty Leverenz <>
> wrote:
>> Does anyone have time to answer this?  It would be good to clarify things
>> in the wiki.
>> HIVE-3649 <> added the
>> list bucketing feature in release 0.10.0.  The description says:
>> We need to differ normal skewed table from list bucketing table. we use
>>> an optional parameter "store as DIRECTORIES"
>> So I think your understanding is correct, but let's hear from the experts.
>> -- Lefty
>> On Fri, Jun 27, 2014 at 1:25 PM, Steven Willis <>
>> wrote:
>>> I'm having trouble understanding the difference between a skewed table
>>> and a list bucketed table:
>>> Is the only difference that ListBucketing stores the data as directories
>>> and a "plain" skewed table stores them as files? I think that's what the
>>> wiki page is saying, but it's very confusing. For one, the title of the
>>> page is ListBucketing and in many places it seems to use the phrase "List
>>> Bucketing" as the general feature of partitioning a table by skewed columns
>>> (whether in directories or files).
>>> There's a section "Skewed Table vs. List Bucketing Table" (
>>> I would assume would spell out the differences between the two, but it says:
>>>  - Skewed Table is a table which has skewed information.
>>>  - List Bucketing Table is a skewed table. In addition, it tells Hive to
>>> use the list bucketing feature on the skewed table: create sub-directories
>>> for skewed values.
>>> That makes it seem like "the list bucketing feature" is just using
>>> sub-directories for the data. If that's the case, why is the whole article
>>> titled ListBucketing, and why is the section describing the basic idea
>>> (that apparently both skewed tables and list bucketed tables have in
>>> common) titled just "List Bucketing" (
>>> ).
>>> The article also says, "Mainly due to its sub-directory nature, list
>>> bucketing can't coexist with some features." So does that mean just list
>>> bucketing (the subdirectory feature that skewed tables can have as an
>>> option) is incompatible with the features mentioned, or does it mean that
>>> any skewed table is incompatible with said features.
>>> -Steve

View raw message