hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lefty Leverenz <leftylever...@gmail.com>
Subject Re: Skewed vs ListBucketing
Date Wed, 02 Jul 2014 08:38:14 GMT
Well, it turns out I dropped the ball on improving the list bucketing docs
back in April.  See the message thread "Skewed Tables"
<http://mail-archives.apache.org/mod_mbox/hive-user/201404.mbox/%3c4B94C3FD-B6D2-4844-8CEB-7C992A2261F6@hortonworks.com%3e>
that Mayur Gupta started on April 21 and Prasanth Jayachandran left in my
hands on April 28:

There are two different optimizations that use "SKEWED BY” keyword. One is
> skewed join optimization and other is list bucketing optimization. I think
> we need to mention this in some place so that users are aware of the
> difference between the two. “STORED AS DIRECTORIES” is used by only one
> optimization i.e list bucketing.


I'm open to suggestions about how to improve the doc, or I'll tackle it as
best I can with the given information.

-- Lefty


On Wed, Jul 2, 2014 at 1:55 AM, Lefty Leverenz <leftyleverenz@gmail.com>
wrote:

> The Skewed Tables
> <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-SkewedTables>
> section in the DDL wikidoc has more information which might be helpful.
>
> HIVE-3649 was just one of several jiras that added list bucketing in
> releases 0.10 and 0.11.  See HIVE-3026
> <https://issues.apache.org/jira/browse/HIVE-3026> for links to the rest
> of them.  (The one that added DML support hasn't been documented yet:
> HIVE-3073 <https://issues.apache.org/jira/browse/HIVE-3073>.)
>
> I'm revising the jira links in the wiki now.
>
> -- Lefty
>
>
> On Wed, Jul 2, 2014 at 1:25 AM, Lefty Leverenz <leftyleverenz@gmail.com>
> wrote:
>
>> Does anyone have time to answer this?  It would be good to clarify things
>> in the wiki.
>>
>> HIVE-3649 <https://issues.apache.org/jira/browse/HIVE-3649> added the
>> list bucketing feature in release 0.10.0.  The description says:
>>
>> We need to differ normal skewed table from list bucketing table. we use
>>> an optional parameter "store as DIRECTORIES"
>>
>>
>> So I think your understanding is correct, but let's hear from the experts.
>>
>> -- Lefty
>>
>>
>> On Fri, Jun 27, 2014 at 1:25 PM, Steven Willis <swillis@compete.com>
>> wrote:
>>
>>> I'm having trouble understanding the difference between a skewed table
>>> and a list bucketed table:
>>>
>>> https://cwiki.apache.org/confluence/display/Hive/ListBucketing
>>>
>>> Is the only difference that ListBucketing stores the data as directories
>>> and a "plain" skewed table stores them as files? I think that's what the
>>> wiki page is saying, but it's very confusing. For one, the title of the
>>> page is ListBucketing and in many places it seems to use the phrase "List
>>> Bucketing" as the general feature of partitioning a table by skewed columns
>>> (whether in directories or files).
>>>
>>> There's a section "Skewed Table vs. List Bucketing Table" (
>>> https://cwiki.apache.org/confluence/display/Hive/ListBucketing#ListBucketing-ListBucketing)
that
>>> I would assume would spell out the differences between the two, but it says:
>>>
>>>  - Skewed Table is a table which has skewed information.
>>>  - List Bucketing Table is a skewed table. In addition, it tells Hive to
>>> use the list bucketing feature on the skewed table: create sub-directories
>>> for skewed values.
>>>
>>> That makes it seem like "the list bucketing feature" is just using
>>> sub-directories for the data. If that's the case, why is the whole article
>>> titled ListBucketing, and why is the section describing the basic idea
>>> (that apparently both skewed tables and list bucketed tables have in
>>> common) titled just "List Bucketing" (
>>> https://cwiki.apache.org/confluence/display/Hive/ListBucketing#ListBucketing-ListBucketing
>>> ).
>>>
>>> The article also says, "Mainly due to its sub-directory nature, list
>>> bucketing can't coexist with some features." So does that mean just list
>>> bucketing (the subdirectory feature that skewed tables can have as an
>>> option) is incompatible with the features mentioned, or does it mean that
>>> any skewed table is incompatible with said features.
>>>
>>> -Steve
>>>
>>
>>
>

Mime
View raw message