Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 81EFE189EC for ; Tue, 2 Feb 2016 11:16:22 +0000 (UTC) Received: (qmail 20021 invoked by uid 500); 2 Feb 2016 11:16:20 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 19946 invoked by uid 500); 2 Feb 2016 11:16:20 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 19932 invoked by uid 99); 2 Feb 2016 11:16:20 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Feb 2016 11:16:20 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id AD56F1804F0 for ; Tue, 2 Feb 2016 11:16:19 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.601 X-Spam-Level: *** X-Spam-Status: No, score=3.601 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=2.5, KAM_COUK=1.1, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 5T7V842Q0kC1 for ; Tue, 2 Feb 2016 11:16:07 +0000 (UTC) Received: from sulu.netzoomi.net (sulu.netzoomi.net [83.138.144.103]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTP id A31AE429BC for ; Tue, 2 Feb 2016 11:16:06 +0000 (UTC) Received: from vulcan.netzoomi.net (unknown [212.100.249.54]) by sulu.netzoomi.net (Postfix) with ESMTP id 6FF466A4E5E for ; Tue, 2 Feb 2016 11:15:58 +0000 (GMT) X-Envelope-From: Received: from w7 (cpc86449-seve24-2-0-cust177.13-3.cable.virginm.net [86.19.59.178]) by vulcan.netzoomi.net (Postfix) with ESMTPA id 3CBF2124830F for ; Tue, 2 Feb 2016 11:15:58 +0000 (GMT) From: "Mich Talebzadeh" To: References: <56AF90BB.7090606@gmail.com> <06ac01d15d1a$63f97240$2bec56c0$@peridale.co.uk> In-Reply-To: Subject: RE: ORC format Date: Tue, 2 Feb 2016 11:16:28 -0000 Message-ID: <06e801d15dab$296ff950$7c4febf0$@peridale.co.uk> MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_NextPart_000_06E9_01D15DAB.29730690" X-Mailer: Microsoft Outlook 16.0 Thread-Index: AQDrUyYpVdTx2mUBGDAqTd5WJIJvJgIeLxu6As51hn0Bj9aiFAJC/m5NoJ6zrVA= Content-Language: en-gb This is a multipart message in MIME format. ------=_NextPart_000_06E9_01D15DAB.29730690 Content-Type: multipart/alternative; boundary="----=_NextPart_001_06EA_01D15DAB.29730690" ------=_NextPart_001_06EA_01D15DAB.29730690 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Correct :).=20 =20 Lord knows how these spell checkers work sometime! Perish the thought of = demoralising the data. =20 =20 Regards, =20 =20 Dr Mich Talebzadeh =20 LinkedIn = https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCd= OABUrV8Pw =20 Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-0919= 08.pdf Author of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7.=20 co-author "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: = 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, = volume one out shortly =20 http://talebzadehmich.wordpress.com = =20 =20 NOTE: The information in this email is proprietary and confidential. = This message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Technology Ltd, its = subsidiaries nor their employees accept any responsibility. =20 From: Lefty Leverenz [mailto:leftyleverenz@gmail.com]=20 Sent: 02 February 2016 10:26 To: user@hive.apache.org Subject: Re: ORC format =20 Can't resist teasing Mich about this: "Indeed one often demoralises = data taking advantages of massive parallel processing in Hive." =20 Surely he meant denormalizes = . Nobody would want to = demoralise their data -- performance would suffer. ;) -- Lefty =20 =20 On Mon, Feb 1, 2016 at 10:00 AM, Mich Talebzadeh > wrote: Thanks Alan for this explanation. Interesting to see Primary Key in = Hive. =20 =20 Sometimes comparison is made between Hive Storage Index concept in Orc = and Oracle Exadata storage index that also uses the same terminology! =20 It is a bit of a misnomer to call Oracle Exadata indexes a = =E2=80=9Cstorage index=E2=80=9D, since it appears that Exadata stores = data block from tables in the storage index, usually when they are = accessed via a full-table scan. In this context Exadata storage index = is not a =E2=80=9Creal=E2=80=9D index in the sense that the storage = index exists only in RAM, and it must be re-created from scratch when = the Exadata server is bounced. =20 Oracle Exadata and SAP HANA as far as I know force serial scans into = Hardware - with HANA, it is by pushing the bitmaps into the L2 cache on = the chip - Oracle has special processors on SPARC T5 called D???? = that offloads the column bit scan off the CPU and onto = separate specialized HW. As a result, both rely on massive = parallelization.. =20 =20 Orc storage index is neat and different from both Exadata and SAP HANA, = The way I see ORC storage indexes =20 * They are combined Index and statistics.=20 * Each index has statistics of min, max, count, and sum for each = column in the row group of 10,000 rows. * Crucially, it has the location of the start of each row group, = so that the query can jump straight to the beginning of the row group.=20 * The query can do a SARG pushdown that limits which rows are = required for the query and can avoid reading an entire file, or at least = sections of the file which is by and large what a conventional RDBMS = B-tree index does. =20 =20 Cheers, =20 Dr Mich Talebzadeh =20 LinkedIn = https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6zP6AcPCCd= OABUrV8Pw =20 Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-0919= 08.pdf Author of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7.=20 co-author "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: = 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, = volume one out shortly =20 http://talebzadehmich.wordpress.com = =20 =20 NOTE: The information in this email is proprietary and confidential. = This message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Technology Ltd, its = subsidiaries nor their employees accept any responsibility. =20 From: Alan Gates [mailto:alanfgates@gmail.com = ]=20 Sent: 01 February 2016 17:07 To: user@hive.apache.org =20 Subject: Re: ORC format =20 ORC does not currently expose a primary key to the user, though we have = talked of having it do that. As Mich says the indexing on ORC is = oriented towards statistics that help the optimizer plan the query. = This can be very important in split generation (determining which parts = of the input will be read by which tasks) as well as on the fly input = pruning (deciding not to read a section of the file because the stats = show that no rows in that section will match a predicate). Either of = these can help joins. But as there is not a user visible primary key = there's no ability to rewrite the join as an index based join, which I = think is what you were asking about in your original email. Alan. Philip Lee February 1, 2016 at 7:27 Also, =20 when making ORC from CSV,=20 for indexing every key on each coulmn is made, or a primary on a table = is made ? =20 If keys are made on each column in a table, accessing any column in some = functions like filtering should be faster. =20 --=20 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D Hae Joon Lee =20 Now, in Germany, M.S. Candidate, Interested in Distributed System, Iterative Processing Dept. of Computer Science, Informatik in German, TUB Technical University of Berlin =20 In Korea, M.S. Candidate, Computer Architecture Laboratory Dept. of Computer Science, KAIST=20 =20 Rm# 4414 CS Dept. KAIST 373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)=20 =20 Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D Philip Lee February 1, 2016 at 7:21 Hello, =20 I experiment the performance of some systems between ORC and CSV file. I read about ORC documentation on Hive website, but still curious of = some things. =20 I know ORC format is faster on filtering or reading because it has = indexing. Has it advantage of joining two tables of ORC dataset as well? =20 Could you explain about it in detail? When experimenting, it seems like it has some advantages of joining in = some aspect, but not quite sure what characteristic of ORC make this = happening rather than CSV. =20 Best, Phil =20 =20 ------=_NextPart_001_06EA_01D15DAB.29730690 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

Correct J.

 

Lord knows how these spell checkers work sometime! Perish = the thought of demoralising the data.

 

 

Regards,

 

 

Dr Mich = Talebzadeh

 

LinkedIn = =C2=A0https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2g= BxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 = Gold Medal Award 2008

A= Winning Strategy: Running the most Critical Financial Data on ASE = 15

http://login.sybase.com/files/Product_Overviews/ASE-Winni= ng-Strategy-091908.pdf

Auth= or of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7. =

co-a= uthor "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4

Publications due shortly:

Com= plex Event Processing in Heterogeneous Environments, = ISBN: 978-0-9563693-3-8

Oracle and = Sybase, Concepts and Contrasts, ISBN: = 978-0-9563693-1-4, volume one out = shortly

 

http://talebzadehmich.wordp= ress.com

 

NOTE= : The information in this email is proprietary and confidential. This = message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Technology Ltd, its = subsidiaries nor their employees accept any responsibility.<= o:p>

 

From:<= /b> Lefty = Leverenz [mailto:leftyleverenz@gmail.com]
Sent: 02 February = 2016 10:26
To: user@hive.apache.org
Subject: Re: ORC = format

 

Can't = resist teasing Mich about this:  "Indeed one often demoralises = data taking advantages of massive parallel processing in = Hive."

 

Surely he meant denormalizes.&= nbsp; Nobody would want to demoralise their data -- performance would = suffer.  ;)


-- = Lefty

 

 

On Mon, = Feb 1, 2016 at 10:00 AM, Mich Talebzadeh <mich@peridale.co.uk> = wrote:

Thanks Alan = for this explanation. Interesting to see Primary Key in = Hive.

 

 <= /o:p>

Sometimes comparison is made = between Hive Storage Index concept in Orc and Oracle Exadata =  storage index that also uses the same = terminology!

 <= /o:p>

It is a bit of a misnomer to = call Oracle Exadata indexes a =E2=80=9Cstorage index=E2=80=9D, since it = appears that Exadata stores data block from tables in the storage index, = usually when they are accessed via a full-table scan.  In this = context Exadata storage index is not a =E2=80=9Creal=E2=80=9D index in = the sense that the storage index exists only in RAM, and it must be = re-created from scratch when the Exadata server is = bounced.

 

Ora= cle Exadata  and SAP HANA as far as I know force serial scans into = Hardware - with HANA, it is by pushing the bitmaps into the L2 cache on = the chip - Oracle has special processors on SPARC T5 called D???? = <something> that offloads the column bit scan off the CPU and onto = separate specialized HW.  As a result, both rely on massive = parallelization..

&nb= sp;

&nb= sp;

Orc= storage index is neat and different from both Exadata and SAP = HANA, The way I = see ORC storage indexes

 

=C2=B7        = ; They are = combined Index and statistics.

=C2=B7        = ; Each index = has statistics of min, max, count, and sum for each column in the row = group of 10,000 rows.

=C2=B7        = ; Crucially, = it has the location of the start of each row group, so that the query = can jump straight to the beginning of the row group. =

=C2=B7        = ; The query = can do  a SARG pushdown that limits which rows are required for the = query and can avoid reading an entire file, or at least sections of the = file which is by and large what a conventional RDBMS B-tree index = does.

 

 

Cheers,<= o:p>

 

Dr Mich = Talebzadeh

 

LinkedIn =  https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gB= xianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 = Gold Medal Award 2008

A= Winning Strategy: Running the most Critical Financial Data on ASE = 15

http://login.sybase.com/files/Product_Overviews/ASE-Win= ning-Strategy-091908.pdf

Auth= or of the books "A Practitioner=E2=80=99s Guide to Upgrading to = Sybase ASE 15", ISBN 978-0-9563693-0-7. =

co-a= uthor "Sybase Transact SQL Guidelines Best Practices", ISBN = 978-0-9759693-0-4

Publications due shortly:

Complex Event = Processing in Heterogeneous Environments, ISBN: = 978-0-9563693-3-8

Oracle and = Sybase, Concepts and Contrasts, ISBN: = 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com

 

NOTE: The = information in this email is proprietary and confidential. This message = is for the designated recipient only, if you are not the intended = recipient, you should destroy it immediately. Any information in this = message shall not be understood as given or endorsed by Peridale = Technology Ltd, its subsidiaries or their employees, unless expressly so = stated. It is the responsibility of the recipient to ensure that this = email is virus free, therefore neither Peridale Technology Ltd, its = subsidiaries nor their employees accept any = responsibility.

 

From:<= /b> Alan Gates = [mailto:alanfgates@gmail.com]
Sent: 01 February = 2016 17:07
To: user@hive.apache.org
Subject: Re: ORC = format

 <= /o:p>

ORC does not = currently expose a primary key to the user, though we have talked of = having it do that.  As Mich says the indexing on ORC is oriented = towards statistics that help the optimizer plan the query.  This = can be very important in split generation (determining which parts of = the input will be read by which tasks) as well as on the fly input = pruning (deciding not to read a section of the file because the stats = show that no rows in that section will match a predicate).  Either = of these can help joins.  But as there is not a user visible = primary key there's no ability to rewrite the join as an index based = join, which I think is what you were asking about in your original = email.

Alan.

February 1, 2016 at = 7:27

Also, =  

when making ORC from = CSV, 

for indexing every key on each = coulmn is made, or a primary on a table is made = ?

 

If keys are made on each column = in a table, accessing any column in some functions like filtering should = be faster.




 

-- =

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

Hae = Joon Lee

 

Now, in = Germany,

M.S. = Candidate, Interested in Distributed System, Iterative = Processing

Dept. = of Computer Science, Informatik in German, TUB

Technical University of = Berlin

 

In = Korea,

M.S. = Candidate, Computer Architecture Laboratory

Dept. = of Computer Science, KAIST 

 

Rm# = 4414 CS Dept. KAIST

373-1 = Guseong-dong, Yuseong-gu, Daejon, South Korea = (305-701) 

 

Mobile) = 49) 015-251-448-278 in Germany, no cellular in = Korea

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

=

February 1, 2016 at = 7:21

Hello,

 

I experiment the performance of = some systems between ORC and CSV = file.

I read about ORC documentation = on Hive website, but still curious of some = things.

 

I know ORC format is faster on = filtering or reading because it has = indexing.

Has it advantage of joining two = tables of ORC dataset as well?

 

Could you explain about it in = detail?

When experimenting, it seems = like it has some advantages of joining in some aspect, but not quite = sure what characteristic of ORC make this happening rather than = CSV.

 

Best,

Phil

=

 

 

------=_NextPart_001_06EA_01D15DAB.29730690-- ------=_NextPart_000_06E9_01D15DAB.29730690 Content-Type: image/jpeg; name="image001.jpg" Content-Transfer-Encoding: base64 Content-ID: /9j/4AAQSkZJRgABAQEARwBHAAD/2wBDAAEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQECAQEB AQEBAgICAgICAgICAgICAgICAgICAgICAgICAgICAgL/2wBDAQEBAQEBAQICAgICAgICAgICAgIC AgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgL/wAARCAAZABkDAREA AhEBAxEB/8QAGAAAAwEBAAAAAAAAAAAAAAAABgcICQr/xAA0EAABAwMCAgUKBwAAAAAAAAACAQME BQYRABITIQcUMUF2CBUXIjI2N0JRtVRWkZOV0dL/xAAYAQEAAwEAAAAAAAAAAAAAAAADAAEEAv/E ACQRAAICAAQGAwAAAAAAAAAAAAABAhEDMrHREyExM0FxgfDx/9oADAMBAAIRAxEAPwDuEt+gW/UL et6oVC3rfqNQqFv0OfPn1GhUqfOmzZtKZlS5UqZMaNwzNwiJVIl7eXLCaZIGwBl3TY8epPx2+jy2 ZNPjvkwc9uhW8j7nCPhvOsQliYIeS7cvCpp8o50qwrC4v3lsNSDbdmTEhvs2tahxpfV3WnmbbozJ Ew/gwdadbYExVRXKEKoSdvJcaOSqxE7/AAiX0gXx+a69/JSf9alIlste0VzaNpeFrcT9KKymotyi aZ0KRCnzacoE7Kjzn4gi2KqUh3jqDHDHv4mRUfruTWlMzlVUKIVNp9GguEJnAh0+IZjyAiisgyRD nu5azS8miKqjOTVkKqS/psG37fo1Fbabeg25b8eZPeFJBBJSjMG5HjMeyihnaauZwe4OGiju13GA cpOwBeN+U8/IkGbsiS8b7ryogmbzhbyc9REROfZhERO5ETShjPtvpGqTUyLErytS4siSwx5x2tRH 4hPOI0DkjZtaJtFxuVEbIUUiyeNujlBUJGbJN6nM/Cyf2Hf60YgjvKA+NPSP4gT7axpcPtr51YWJ nYn9dnAQWl722p4ot37yzqnlfp6FrqbwawG8/9k= ------=_NextPart_000_06E9_01D15DAB.29730690--