Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5B15010859 for ; Mon, 28 Apr 2014 21:01:36 +0000 (UTC) Received: (qmail 46192 invoked by uid 500); 28 Apr 2014 21:00:40 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 46006 invoked by uid 500); 28 Apr 2014 21:00:30 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 45650 invoked by uid 99); 28 Apr 2014 21:00:15 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Apr 2014 21:00:15 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of pjayachandran@hortonworks.com designates 209.85.160.46 as permitted sender) Received: from [209.85.160.46] (HELO mail-pb0-f46.google.com) (209.85.160.46) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Apr 2014 21:00:09 +0000 Received: by mail-pb0-f46.google.com with SMTP id ma3so2434822pbc.19 for ; Mon, 28 Apr 2014 13:59:49 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:message-id:mime-version:subject:date :references:to:in-reply-to:content-type; bh=JS/4S1yS2QXsmvS6bv8OiNb3eV4mfDbEBAmwWqTQZRU=; b=JJloulOIPDqoeiKwfeVbB8W6nlWXXg15IJ/jam57BHXLxVZgyEOR48Tdkpy1pRsk37 59NfYAcCpzInm8WyCHoD/TzZVc40Qznm+8kZ4BFZYIBznflgLDjQP8oNSFR8JoxLuHyN YBe8pLAb8x7NXTKtqmKC9WQs+NFFy7dOKT7CI3fMKjtWMXZYFFqkXSmnW/FfXCEfdLb5 +s+Ik6j95PmaYbWEhxXnvEZevLiawLtjGRJIDzT1s+xWubMy6tKzEk7+X4PV37K27LFZ JEq4xcj8fCGrIGHvJhmBVKEQjBpAFjE9Aq4LMCUNtKxw/1hR0OxBIiplkz9N/ujE6Mci G71w== X-Gm-Message-State: ALoCoQkrIanrXoMODteoCXYQ17pBlBh4LP52rE+t9OtOEenrxEMKz81qiairdvh5+hp/FcUSTZXW6+BrX4SL6c/dcSHA1Mab98dtA8gNUpMJPju6BoElu3w= X-Received: by 10.67.1.202 with SMTP id bi10mr28006958pad.68.1398718789063; Mon, 28 Apr 2014 13:59:49 -0700 (PDT) Received: from [10.10.8.13] ([192.175.27.2]) by mx.google.com with ESMTPSA id qv9sm36980504pbc.71.2014.04.28.13.59.47 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 28 Apr 2014 13:59:48 -0700 (PDT) From: Prasanth Jayachandran Message-Id: <4B94C3FD-B6D2-4844-8CEB-7C992A2261F6@hortonworks.com> Mime-Version: 1.0 (Mac OS X Mail 7.2 \(1874\)) Subject: Re: Skewed Tables Date: Mon, 28 Apr 2014 13:59:44 -0700 References: <4E01431A-61D8-4799-AC7F-DCF67C5D9CF9@hortonworks.com> To: user@hive.apache.org In-Reply-To: X-Mailer: Apple Mail (2.1874) Content-Type: multipart/alternative; boundary="Apple-Mail=_660C10C4-0D6C-425E-9AD2-813D059E4677" X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_660C10C4-0D6C-425E-9AD2-813D059E4677 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=windows-1252 Lefty, I have updated the hive wiki in few places to say we should use "sto= red as directories" for list bucketing features. There are two different op= timizations that uses "SKEWED BY=94 keyword. One is skewed join optimizatio= n and other is list bucketing optimization. I think we need to mention this= in some place so that users are aware of the difference between the two. = =93STORED AS DIRECTORIES=94 is used by only one optimization i.e list bucke= ting. Following are the design docs for both https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization https://cwiki.apache.org/confluence/display/Hive/ListBucketing Thanks Prasanth Jayachandran On Apr 27, 2014, at 11:28 PM, Lefty Leverenz wrot= e: > Prasanth, Hive's user docs are wiki-only at this point so there's no vers= ion control. We just add notes about which release introduced or changed s= omething. For an example see the beginning of the Skewed Tables section. = Sometimes the version information isn't called out like that, though, it's = just part of the text. And in the CREATE TABLE syntax it's a comment along= side a clause such as TBLPROPERTIES. >=20 > The procedure for getting wiki access is described in About This Wiki: > How to get permission to edit > Create a Confluence account > Sign up for the user mailing list by sending a message to user-subscribe@= hive.apache.org > Send a message to user@hive.apache.org requesting write access >=20 > Ashutosh has been granting wiki edit privileges lately (Carl Steinbach us= ed to do it). I don't know how it's done or I'd gladly give you access. >=20 > I hope you'll be able to take care of this doc because you understand ske= wed tables and I only know what I've read in the wiki, so I think you'll do= a better job. But of course I'll review it and tinker with it a bit. >=20 >=20 > -- Lefty >=20 >=20 > On Mon, Apr 28, 2014 at 1:40 AM, Prasanth Jayachandran wrote: > @Mayur.. I don=92t think the initial design considered CTAS for skewed ta= bles. So it might not be supported at all. >=20 > @Lefty.. I am not sure where/how the docs are maintained. Is it version c= ontrolled? Or is it only maintained in confluence wiki? If it is the later = can you please provide me access to edit the wiki? or alternatively if you = can update the docs adding =93stored as directories=94 to the examples, it = will be great. Also updating the docs with =93CTAS not supported for list b= ucketing=94. >=20 > Thanks > Prasanth Jayachandran >=20 > On Apr 26, 2014, at 8:03 AM, Mayur Gupta wrote: >=20 >> Hey Prasanth, >>=20 >> The CTAS for skewed table doesn't work, is it a bug? >>=20 >> create tablet1(r1 string, r2 string) skewed by (r2) on (=91a=92) stored = as directories select r1, r2 from t2; >>=20 >>=20 >> On Thu, Apr 24, 2014 at 3:03 PM, Mayur Gupta w= rote: >> Thanks a lot Prasanth for the reply. I would have never figured that out= as the documentation at Hive Wiki DDL page and design page doesn't list th= is.=20 >>=20 >> One additional point it seems the Skewed table doesn't work when the tab= le is created as CTAS. The below statement doesn't create separate files. I= s it a bug or is it by intent? >>=20 >> create tablet1(r1 string, r2 string) skewed by (r2) on (=91a=92) stored = as directories select r1, r2 from t2; >>=20 >>=20 >> On Thu, Apr 24, 2014 at 6:12 AM, Prasanth Jayachandran wrote: >> Hi Mayur, >>=20 >> The reason why you see single file is, you have not enabled storing skew= ed columns/values as directories. >> You can do the following to enable storing the skewed columns and values= as directories >>=20 >> set hive.mapred.supports.subdirectories=3Dtrue; >> set mapred.input.dir.recursive=3Dtrue; >> create tablet1(r1 string, r2 string) skewed by (r2) on (=91a=92) stored = as directories; >>=20 >> This will enable you to store the skewed columns as directories below >>=20 >> /user/hive/warehouse/t1/r2=3Da/000000_0 (skewed values go here) >> /user/hive/warehouse/t1/HIVE_DEFAULT_LIST_BUCKETING_DIR_NAME/000000_0 (a= ll other values go here) >>=20 >> With respect to your desc extended question where skewedColValueLocation= Maps is empty, its a bug in implementation. I just verified that it shows e= mpty for unpartitioned tables. But it shows correctly for partitioned table= s. >> I have created a bug for unpartitioned tables here which you can track f= or progress on this issue https://issues.apache.org/jira/browse/HIVE-6968 >>=20 >>=20 >> Thanks >> Prasanth Jayachandran >>=20 >> On Apr 23, 2014, at 6:52 AM, Mayur Gupta wrote= : >>=20 >>> Below is my skewedInfo >>>=20 >>> skewedInfo:SkewedInfo(skewedColNames:[r2], skewedColValues:[[a]], skewe= dColValueLocationMaps:{}) >>>=20 >>> Any idea why is the skewedColValueLocationMaps empty?=20 >>>=20 >>>=20 >>> On Mon, Apr 21, 2014 at 11:19 AM, Mayur Gupta = wrote: >>> Hey There, >>>=20 >>> I was trying to use Skewed tables but I am facing the issue that it is = not creating separate files for the skewed data. Even with a simple example= I am having the same issue. The hive version is 0.11. >>>=20 >>> create table t(col1 string, col2 string); >>> load data local inpath '/home/hadoop/a.txt' into table t;=20 >>>=20 >>> create table t1(r1 string, r2 string) skewed by (r2) on ('a'); >>> insert into table t1 select * from t; >>>=20 >>> The contents of a.txt are : >>> 1 ^Aa >>> 2^A b >>> 3 ^Ac >>> 4 ^Aa >>> 5 ^Ab >>> 6 ^Aa >>>=20 >>> I see only single file. >>>=20 >>> /user/hive/warehouse/t1/000000_0 >>>=20 >>> Any pointers on what I am doing wrong? >>>=20 >>=20 >>=20 >> CONFIDENTIALITY NOTICE >> NOTICE: This message is intended for the use of the individual or entity= to which it is addressed and may contain information that is confidential,= privileged and exempt from disclosure under applicable law. If the reader = of this message is not the intended recipient, you are hereby notified that= any printing, copying, dissemination, distribution, disclosure or forwardi= ng of this communication is strictly prohibited. If you have received this = communication in error, please contact the sender immediately and delete it= from your system. Thank You. >>=20 >>=20 >=20 >=20 > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity = to which it is addressed and may contain information that is confidential, = privileged and exempt from disclosure under applicable law. If the reader o= f this message is not the intended recipient, you are hereby notified that = any printing, copying, dissemination, distribution, disclosure or forwardin= g of this communication is strictly prohibited. If you have received this c= ommunication in error, please contact the sender immediately and delete it = from your system. Thank You. >=20 --=20 CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to= =20 which it is addressed and may contain information that is confidential,=20 privileged and exempt from disclosure under applicable law. If the reader= =20 of this message is not the intended recipient, you are hereby notified that= =20 any printing, copying, dissemination, distribution, disclosure or=20 forwarding of this communication is strictly prohibited. If you have=20 received this communication in error, please contact the sender immediately= =20 and delete it from your system. Thank You. --Apple-Mail=_660C10C4-0D6C-425E-9AD2-813D059E4677 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=windows-1252 Lefty, I have updated = the hive wiki in few places to say we should use "stored as directories" fo= r list bucketing features. There are two different optimizations that uses = "SKEWED BY=94 keyword. One is skewed join optimization and other is list bu= cketing optimization. I think we need to mention this in some place so that= users are aware of the difference between the two. =93STORED AS DIRECTORIE= S=94 is used by only one optimization i.e list bucketing.

Following are the design docs for both
ht= tps://cwiki.apache.org/confluence/display/Hive/ListBucketing
=
Thanks
Prasanth Jayachandran

On Apr 27, 2014, at 11:28 PM, Lefty Leverenz <leftyleverenz@gmail.com> wrote:
=
Prasanth, Hive's user docs are wiki-only at this point so there's = no version control.  We just add notes about which release introduced = or changed something.  For an example see the beginning of the Skewed Tables section.  Sometimes the= version information isn't called out like that, though, it's just part of = the text.  And in the CREATE TABLE = syntax it's a comment alongside a clause such as TBLPROPERTIES.

The procedure for getting wiki access is described in About = This Wiki:

How to get permission to edit

<= /span>

Ashutosh has been granting wiki edit privileges lately = (Carl Steinbach used to do it).  I don't know how it's done or I'd gla= dly give you access.

I hope you'll be able to take= care of this doc because you understand skewed tables and I only know what= I've read in the wiki, so I think you'll do a better job.  But of cou= rse I'll review it and tinker with it a bit.


-- Lefty


On Mon, Apr 28, 2014 at 1:40 AM, Prasant= h Jayachandran <pjayachandran@hortonworks.com> w= rote:
@Mayur..= I don=92t think the initial design considered CTAS for skewed tables. So i= t might not be supported at all.

@Lefty.. I am not sure where/how the docs are maintained. Is= it version controlled? Or is it only maintained in confluence wiki? If it = is the later can you please provide me access to edit the wiki? or alternat= ively if you can update the docs adding =93stored as directories=94 to the = examples, it will be great. Also updating the docs with =93CTAS not support= ed for list bucketing=94.

Thanks
Prasan= th Jayachandran

On Apr 26, 2014, at 8:03 AM, Mayur Gupta <mayur.gupta81@gmail.com>= ; wrote:

Hey Prasanth,<= div>
The CTAS for skewed table doesn't work, is it a bug?

create tablet1(r1 string, r2 string) skewed by (r2) on (=91a= =92) stored as directories select r1, r2 from t2;


O= n Thu, Apr 24, 2014 at 3:03 PM, Mayur Gupta <mayur.gupta81@gmail.c= om> wrote:
Thanks a lot Prasanth for the reply. I wo= uld have never figured that out as the documentation at Hive Wiki DDL page and design page doesn't list this. 

One additional point it seems the Skewed table doesn't work = when the table is created as CTAS. The below statement doesn't create&= nbsp;separate files. Is it a bug or is it by intent?

create tablet1(r1 string, r2 string) skewed by (r2) on (=91a=92) stored as= directories select r1, r2 from t2;


On Thu, Apr 24, 2014 at 6:12 AM, Prasanth Jayachandran <pj= ayachandran@hortonworks.com> wrote:
Hi Mayur= ,

The reason why you see single file is, you have not en= abled storing skewed columns/values as directories.
You can do the following to enable storing the skewed columns and valu= es as directories

set hive.mapred.supports.subdirectories=3Dtrue;
set mapred.inp= ut.dir.recursive=3Dtrue;
create tablet1(r1 string, r2 string) skewed = by (r2) on (=91a=92) stored as directories;

This will enable you to store the skewed columns as dir= ectories below

/user/hive/warehouse/t1/r2=3Da/0000= 00_0 (skewed values go here)
/user/hive/warehouse/t1/HIVE_DEFAULT= _LIST_BUCKETING_DIR_NAME/000000_0 (all other values go here)

With respect to your desc extended question where skewe= dColValueLocationMaps is empty, its a bug in implementation. I just verifie= d that it shows empty for unpartitioned tables. But it shows correctly for = partitioned tables.
I have created a bug for unpartitioned tables here which you can track= for progress on this issue https://issues.apache.org/jira/browse/H= IVE-6968


Thanks
Prasanth Jayachandran

On Apr 23, 2014, at 6:52 AM, Mayur Gupta <mayur.gupta81@gmail.com>= ; wrote:

Below is my sk= ewedInfo

skewedInfo:SkewedInfo(skewedColNames:[r2], skewedColValues:[= [a]], skewedColValueLocationMaps:{})

Any idea = why is the skewedColValueLocationMaps empty? 


On Mon,= Apr 21, 2014 at 11:19 AM, Mayur Gupta <mayur.gupta81@gmail.com&= gt; wrote:
Hey There,

I was tryin= g to use Skewed tables but I am facing the issue that it is not creating se= parate files for the skewed data. Even with a simple example I am having th= e same issue. The hive version is 0.11.

create table t(col1 string, col2 string);
load  data local inpath '/home/hadoop/a.txt' into table t; 
<= /div>

create table t1(r1 string, r2 string) skewed by (r= 2) on ('a');
insert into table t1 select * from t;

The contents of a.txt are :
1 ^Aa
2^A b
=
3 ^Ac
4 ^Aa
5 ^Ab
6 ^Aa

I see only single file.

/user/hive/= warehouse/t1/000000_0

Any pointers on what I a= m doing wrong?



CONFIDENTIALITY NOTICE
NOTICE: This message is = intended for the use of the individual or entity to which it is addressed a= nd may contain information that is confidential, privileged and exempt from= disclosure under applicable law. If the reader of this message is not the = intended recipient, you are hereby notified that any printing, copying, dis= semination, distribution, disclosure or forwarding of this communication is= strictly prohibited. If you have received this communication in error, ple= ase contact the sender immediately and delete it from your system. Thank Yo= u.



=

CONFIDENTIALITY NOTICE
NOTICE: This message is = intended for the use of the individual or entity to which it is addressed a= nd may contain information that is confidential, privileged and exempt from= disclosure under applicable law. If the reader of this message is not the = intended recipient, you are hereby notified that any printing, copying, dis= semination, distribution, disclosure or forwarding of this communication is= strictly prohibited. If you have received this communication in error, ple= ase contact the sender immediately and delete it from your system. Thank Yo= u.



CONFIDENTIALITY NOTICE
NOTICE: This message is = intended for the use of the individual or entity to which it is addressed a= nd may contain information that is confidential, privileged and exempt from= disclosure under applicable law. If the reader of this message is not the = intended recipient, you are hereby notified that any printing, copying, dis= semination, distribution, disclosure or forwarding of this communication is= strictly prohibited. If you have received this communication in error, ple= ase contact the sender immediately and delete it from your system. Thank Yo= u. --Apple-Mail=_660C10C4-0D6C-425E-9AD2-813D059E4677--