Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9E79918B0B for ; Mon, 25 Jan 2016 21:07:08 +0000 (UTC) Received: (qmail 59331 invoked by uid 500); 25 Jan 2016 21:07:07 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 59265 invoked by uid 500); 25 Jan 2016 21:07:07 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 59254 invoked by uid 99); 25 Jan 2016 21:07:06 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Jan 2016 21:07:06 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 80341C094C for ; Mon, 25 Jan 2016 21:07:06 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.101 X-Spam-Level: **** X-Spam-Status: No, score=4.101 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, KAM_COUK=1.1, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id g_-eS5GTAy0P for ; Mon, 25 Jan 2016 21:06:54 +0000 (UTC) Received: from sulu.netzoomi.net (sulu.netzoomi.net [83.138.144.103]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTP id 7890D42BC5 for ; Mon, 25 Jan 2016 21:06:53 +0000 (UTC) Received: from vulcan.netzoomi.net (unknown [212.100.249.54]) by sulu.netzoomi.net (Postfix) with ESMTP id A9C996A4A4C for ; Mon, 25 Jan 2016 21:06:46 +0000 (GMT) X-Envelope-From: Received: from w7 (cpc86449-seve24-2-0-cust177.13-3.cable.virginm.net [86.19.59.178]) by vulcan.netzoomi.net (Postfix) with ESMTPA id 713C31690024 for ; Mon, 25 Jan 2016 21:06:46 +0000 (GMT) From: "Mich Talebzadeh" To: References: In-Reply-To: Subject: RE: Hive Bucketing Date: Mon, 25 Jan 2016 21:07:05 -0000 Message-ID: <03e901d157b4$580b2430$08216c90$@peridale.co.uk> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_03EA_01D157B4.580CD1E0" X-Mailer: Microsoft Outlook 16.0 Thread-Index: AQJtadYnRrdf5ZXEvGBevs+yICKCt53UlLWg Content-Language: en-gb This is a multipart message in MIME format. ------=_NextPart_000_03EA_01D157B4.580CD1E0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi, =20 Can you do =20 show create table =20 And send the output please. =20 Thanks =20 Dr Mich Talebzadeh =20 LinkedIn = https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCd= OABUrV8Pw =20 Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-0919= 08.pdf Author of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7.=20 co-author "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: = 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, = volume one out shortly =20 http://talebzadehmich.wordpress.com = =20 =20 NOTE: The information in this email is proprietary and confidential. = This message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Technology Ltd, its = subsidiaries nor their employees accept any responsibility. =20 From: =E8=B0=AD=E6=88=90=E7=81=B6 [mailto:tanxinz@live.cn]=20 Sent: 25 January 2016 15:37 To: user@hive.apache.org Subject: =E7=AD=94=E5=A4=8D: Hive Bucketing =20 Hi,how to efficient insert into an orc bucket table,I found it too = slow.thanks you=20 _____ =20 =E5=8F=91=E4=BB=B6=E4=BA=BA: Mich Talebzadeh = =20 =E5=8F=91=E9=80=81=E6=97=B6=E9=97=B4: = =E2=80=8E2016/=E2=80=8E1/=E2=80=8E23 7:31 =E6=94=B6=E4=BB=B6=E4=BA=BA: user@hive.apache.org = =20 =E4=B8=BB=E9=A2=98: RE: Hive Bucketing Hi, =20 In general my understanding is that it will be possible to use bucket = pruning much like partition pruning (elimination) soon =20 Bucketing in Hive refers to hash partitioning where a hashing function = is applied. Likewise an RDBMS like Oracle, Hive will apply a linear = hashing algorithm to prevent data from clustering within specific = partitions. Hashing is very effective if the column selected for = bucketing has very high selectivity like an ID column where selectivity = (select count(distinct(column))/count(column) ) =3D 1. In this case, = the created partitions/ files will be as evenly sized as possible. In a = nutshell bucketing is a method to get data evenly distributed over many = partitions/files. One should define the number of buckets by a power of = two -- 2^n, like 2, 4, 8, 16 etc to achieve best results. Again = bucketing will help concurrency in Hive. It may even allow a partition = wise join i.e. a join between two tables that are bucketed on the same = column with the same number of buckets (anyone has tried this?) =20 One more things. When one defines the number of buckets at table = creation level in Hive, the number of partitions/files will be fixed. In = contrast, with partitioning you do not have this limitation.=20 . Have you considered creating these tables as ORC tables? =20 HTH =20 Dr Mich Talebzadeh =20 LinkedIn = https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCd= OABUrV8Pw =20 Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-0919= 08.pdf Author of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7.=20 co-author "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: = 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, = volume one out shortly =20 http://talebzadehmich.wordpress.com = =20 =20 NOTE: The information in this email is proprietary and confidential. = This message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Technology Ltd, its = subsidiaries nor their employees accept any responsibility. =20 From: Akansha Jain [mailto:akansha.15aug85@gmail.com]=20 Sent: 22 January 2016 23:20 To: user@hive.apache.org =20 Subject: RE: Hive Bucketing =20 Thanks for response. I am using 0.13 mapr version. Could you tell more = about bucket pruning.=20 On Jan 22, 2016 3:09 PM, "Mich Talebzadeh" > wrote: Ok we are talking about bucket pruning here =20 What version of Hive are using? =20 Bucket pruning I believe is available from version 2.0 =20 HTH =20 Dr Mich Talebzadeh =20 LinkedIn = https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCd= OABUrV8Pw =20 Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-0919= 08.pdf Author of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7.=20 co-author "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: = 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, = volume one out shortly =20 http://talebzadehmich.wordpress.com = =20 =20 NOTE: The information in this email is proprietary and confidential. = This message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Technology Ltd, its = subsidiaries nor their employees accept any responsibility. =20 From: Akansha Jain [mailto:akansha.15aug85@gmail.com = ]=20 Sent: 22 January 2016 21:55 To: user@hive.apache.org =20 Subject: Hive Bucketing =20 Hi All, I have enabled bucketing in table. I created 256 buckets on user id. Now = when I am querying (select count(*) from table where userid = =3D172839393) that table, map reduce should only use single partitioned = file as input to mappers. But its considering all files as input to = mapper and I don't see any performance benefit when I run same query in = unbucketed table.=20 Do I have to set any property before running queries on bucketed tables. = I tried join query also, but no performance improvement. In fact, I = think it's taking few more seconds as compared to unbucketed table.=20 Thanks, AJ ------=_NextPart_000_03EA_01D157B4.580CD1E0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

Hi,

 

Can you do

 

show create table = <table_name>

 

And send the output please.

 

Thanks

 

Dr Mich = Talebzadeh

 

LinkedIn = =C2=A0https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2g= BxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 = Gold Medal Award 2008

A= Winning Strategy: Running the most Critical Financial Data on ASE = 15

http://login.sybase.com/files/Product_Overviews/ASE-Winni= ng-Strategy-091908.pdf

Auth= or of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7. =

co-a= uthor "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4

Publications due shortly:

Com= plex Event Processing in Heterogeneous Environments, = ISBN: 978-0-9563693-3-8

Oracle and = Sybase, Concepts and Contrasts, ISBN: = 978-0-9563693-1-4, volume one out = shortly

 

http://talebzadehmich.wordp= ress.com

 

NOTE= : The information in this email is proprietary and confidential. This = message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Technology Ltd, its = subsidiaries nor their employees accept any responsibility.<= o:p>

 

From:<= /b> = =E8=B0=AD=E6=88=90=E7=81=B6 = [mailto:tanxinz@live.cn]
Sent: 25 January 2016 = 15:37
To: user@hive.apache.org
Subject:
=E7=AD=94=E5=A4=8D: Hive = Bucketing

 

Hi,how to = efficient insert into an orc bucket  table,I found it too = slow.thanks you


=E5=8F=91=E4=BB=B6=E4=BA=BA: = Mich = Talebzadeh
=E5=8F=91=E9=80=81=E6=97=B6=E9=97=B4<= span style=3D'font-size:11.0pt;font-family:"Calibri",sans-serif'>: = =E2=80=8E2016= /=E2=80=8E1/=E2=80=8E23 7:31
=E6=94=B6=E4=BB=B6=E4=BA=BA: = user@hive.apache.org
<= b>=E4=B8=BB
=E9=A2=98: = RE: Hive = Bucketing

Hi,=

 

In general my = understanding is that it will be possible to use bucket pruning much = like partition pruning (elimination) soon

 

Bucketing in = Hive refers to hash partitioning where a hashing function is applied. = Likewise an RDBMS like Oracle, Hive will apply a linear hashing = algorithm to prevent data from clustering within specific partitions. = Hashing is very effective if the column selected for bucketing has very = high selectivity like an ID column where selectivity (select = count(distinct(column))/count(column) ) =3D 1.  In this case, the = created partitions/ files will be as evenly sized as possible. In a = nutshell bucketing is a method to get data evenly distributed over many = partitions/files.  One should define the number of buckets by a = power of two -- 2^n,  like 2, 4, 8, 16 etc to achieve best results. = Again bucketing will help concurrency in Hive. It may even allow a = partition wise join i.e. a join between two tables that are bucketed on = the same column with the same number of buckets (anyone has tried = this?)

 

One more = things. When one defines the number of buckets at table creation level = in Hive, the number of partitions/files will be fixed. In contrast, with = partitioning you do not have this limitation.

.

Have you = considered creating these tables as ORC tables?

 

HTH=

 

Dr Mich = Talebzadeh

 

LinkedIn =  https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2g= BxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 = Gold Medal Award 2008

A= Winning Strategy: Running the most Critical Financial Data on ASE = 15

http://login.sybase.com/files/Product_Overviews/ASE-Winni= ng-Strategy-091908.pdf

Auth= or of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7. =

co-a= uthor "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4

Publications due shortly:

Com= plex Event Processing in Heterogeneous Environments, = ISBN: 978-0-9563693-3-8

Oracle and = Sybase, Concepts and Contrasts, ISBN: = 978-0-9563693-1-4, volume one out = shortly

&nb= sp;

http://talebzadehmich.wordp= ress.com

 

NOTE= : The information in this email is proprietary and confidential. This = message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Technology Ltd, its = subsidiaries nor their employees accept any = responsibility.

 

From:<= /b> Akansha = Jain [mailto:akansha.15aug85@gmail.co= m]
Sent: 22 January 2016 23:20
To: user@hive.apache.org
Subje= ct: RE: Hive Bucketing

 

Thanks for response. I am = using 0.13 mapr version. Could you tell more about bucket pruning. =

On Jan 22, 2016 3:09 PM, = "Mich Talebzadeh" <mich@peridale.co.uk> = wrote:

Ok we are = talking about bucket pruning here

 

What version = of Hive are using?

 

Bucket pruning I = believe is available from version 2.0

 

HTH

 

Dr Mich = Talebzadeh

 

LinkedIn =  https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gB= xianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 = Gold Medal Award 2008

A= Winning Strategy: Running the most Critical Financial Data on ASE = 15

http://login.sybase.com/files/Product_Overviews/ASE-Win= ning-Strategy-091908.pdf

Auth= or of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7. =

co-a= uthor "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4

Publications due shortly:

Com= plex Event Processing in Heterogeneous Environments, = ISBN: 978-0-9563693-3-8

Oracle and = Sybase, Concepts and Contrasts, ISBN: = 978-0-9563693-1-4, volume one out = shortly

&nb= sp;

http://talebzadehmich.wordpress.com

 

NOTE= : The information in this email is proprietary and confidential. This = message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Technology Ltd, its = subsidiaries nor their employees accept any = responsibility.

 

From:<= /b> Akansha = Jain [mailto:akansha.15aug85@gmail.com]
Sent: 22 = January 2016 21:55
To: user@hive.apache.org
Subject: Hive = Bucketing

 

Hi All,
I have enabled = bucketing in table. I created 256 buckets on user id. Now when I am = querying (select count(*) from table where userid =3D172839393)  = that table, map reduce should only use single partitioned file as input = to mappers. But its considering all files as input to mapper and I don't = see any performance benefit when I run same query in unbucketed table. =

Do I have to set any property before running queries = on bucketed tables. I tried join query also, but no performance = improvement. In fact, I think it's taking few more seconds as compared = to unbucketed table. =

Thanks,
AJ

------=_NextPart_000_03EA_01D157B4.580CD1E0--