Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C12CD18563 for ; Sun, 6 Mar 2016 16:06:31 +0000 (UTC) Received: (qmail 56591 invoked by uid 500); 6 Mar 2016 16:06:30 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 56499 invoked by uid 500); 6 Mar 2016 16:06:30 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 56489 invoked by uid 99); 6 Mar 2016 16:06:30 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 06 Mar 2016 16:06:30 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id C40481804E1 for ; Sun, 6 Mar 2016 16:06:29 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.468 X-Spam-Level: * X-Spam-Status: No, score=1.468 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_FONT_FACE_BAD=0.289, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id T1cEUksswG5A for ; Sun, 6 Mar 2016 16:06:27 +0000 (UTC) Received: from mail-vk0-f52.google.com (mail-vk0-f52.google.com [209.85.213.52]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id BF2065F2EF for ; Sun, 6 Mar 2016 16:06:26 +0000 (UTC) Received: by mail-vk0-f52.google.com with SMTP id e6so95836446vkh.2 for ; Sun, 06 Mar 2016 08:06:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to; bh=h/M5rbTU7wxachpP0WXUQ/jB+ugKiWY4try5dA68Qzk=; b=qDyV1k0ETCycYuYzUEUIbYv4y8ycdkXU6GgDyNLbfxGf86MxO6eCa01JHfC2y2vzhK dsWoIFI51TcKpo4BDljvDCowOfU8Beh5XadHQ6iY1kCwb5V02KjpUhLlB17vW2JjeDfh f4M+j8KROsMK2RhuejWXdyyIoh/1BifIX8jgjD2PEofINI6w7RhtReko2SOK13jtX8q4 Zpb1t0N/rjXFnKea9maBO823xpOahyXHj5Shdpdi0KQsi1x0gzEAU0s7wwHnzufBPf7l 5h867TJwjiurAb00bHj4QJT39Dyo6ytrxzExFRoVaOM2FAsKAwtfPHDgNwGVvOFpzTCo 3BNw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to; bh=h/M5rbTU7wxachpP0WXUQ/jB+ugKiWY4try5dA68Qzk=; b=H3ZpIRTJ2YoAdDO5iVCKib2BGGP5h7OvJGOeqimFp4OwLz4YbHV3hqFFbkalReqWVI WsyUB5OhmjFjCvwVmdvdZgfOJ0BlKyRk1XV0egtYVyLO1LXork4Dz8C904iYLxCA22d+ nISXqTPIp6EzeM/HlO0GaoTA1ppVwmgDFkQRY7dPtAbNSqyHD4InYyR7FMYCM4fbAx7V xd8ZH81isxcLOFBjcxFzEslAygr50QGS+z7BjBkbh77epQrHovBcMhY5je6akIB+o4on FtVIRDEnfUIIhT25WtGoTbO/b5N14osiJZpqFPjgsed280iw7CvI3JyCRCR1t5NBB5AY le6A== X-Gm-Message-State: AD7BkJLu1VmrWjyYpy/92lN+M1ntO8MXyFX7DUMLcf7duQhqLG5mOzU10ca9fXIVjG16OtGy3QCoEUoTgB9yVQ== MIME-Version: 1.0 X-Received: by 10.31.15.4 with SMTP id 4mr13986641vkp.10.1457280385720; Sun, 06 Mar 2016 08:06:25 -0800 (PST) Received: by 10.31.128.213 with HTTP; Sun, 6 Mar 2016 08:06:25 -0800 (PST) In-Reply-To: <56DC4DEE.4090801@sonra.io> References: <56DC4DEE.4090801@sonra.io> Date: Sun, 6 Mar 2016 16:06:25 +0000 Message-ID: Subject: Re: Parquet versus ORC From: Mich Talebzadeh To: user@hive.apache.org Content-Type: multipart/alternative; boundary=001a11433da8baa790052d6387cf --001a11433da8baa790052d6387cf Content-Type: text/plain; charset=UTF-8 Hi, Thanks for that link. It appears that the main advantages of Parquet is stated as and I quote: "Parquet is built to be used by anyone. The Hadoop ecosystem is rich with data processing frameworks, and we are not interested in playing favorites. We believe that an efficient, well-implemented columnar storage substrate should be useful to all frameworks without the cost of extensive and difficult to set up dependencies." Fair enough Parquet provides columnar format and compression. As I stated I do not know much about it. However, my understanding of ORC is that it provides better encoding of data, Predicate push down for some predicates plus support for ACID properties. As Alan Gates stated before (Hive user forum, "Difference between ORC and RC files" , 21 Dec 15) and I quote "Whether ORC is the best format for what you're doing depends on the data you're storing and how you are querying it. If you are storing data where you know the schema and you are doing analytic type queries it's the best choice (in fairness, some would dispute this and choose Parquet, though much of what I said above (about ORC vs RC applies to Parquet as well). If you are doing queries that select the whole row each time columnar formats like ORC won't be your friend. Also, if you are storing self structured data such as JSON or Avro you may find text or Avro storage to be a better format. So what would be the main advantage(s) of Parquet over ORC please besides using queries that select whole row (much like "a row based" type relational database does). Cheers. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 6 March 2016 at 15:34, Uli Bethke wrote: > Curious why you think that Parquet does not have metadat at file, row > group or column level. > Please refer here to the type of metadata that Parquet supports in the > docs http://parquet.apache.org/documentation/latest/ > > > n 06/03/2016 15:26, Mich Talebzadeh wrote: > > Hi. > > I have been hearing a fair bit about Parquet versus ORC tables. > > In a nutshell I can say that Parquet is a predecessor to ORC (both provide > columnar type storage) but I notice that it is still being used > especially with Spark users. > > In mitigation it appears that Spark users are reluctant to use ORC despite > the fact that with inbuilt Store Index it offers superior optimisation with > data and stats at file, stripe and row group level. Both Parquet and ORC > offer SNAPPY compression as well. ORC offers ZLIB as default. > > There may be other than technical reasons for this adaption, for example > too much reliance on Hive plus the fact that it is easier to flatten > Parquet than ORC (whatever that means). > > I for myself use either text files or ORC with Hive and Spark and don't > really see any reason why I should adopt others like Avro, Parquet etc. > > Appreciate any verification or experience on this. > > Thanks > , > > Dr Mich Talebzadeh > > > > LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > * > > > > http://talebzadehmich.wordpress.com > > > > > > -- > ___________________________ > Uli Bethke > Chair Hadoop User Group Irelandwww.hugireland.org > HUG Ireland is community sponsor of Hadoop Summit Europe in Dublin http://2016.hadoopsummit.org/dublin/ > > --001a11433da8baa790052d6387cf Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi,

Thanks for that link.

It appears that the main advantages of Parquet is s= tated as and I quote:

"Parquet is built to be= used by anyone. The Hadoop ecosystem is rich with data processing framewor= ks, and we are not interested in playing favorites. We believe that an effi= cient, well-implemented columnar storage substrate should be useful to all = frameworks without the cost of extensive and difficult to set up dependenci= es."

Fair enough=C2=A0Parquet provides column= ar format and compression. As I stated I do not know much about it. However= , my understanding of ORC is that it provides better encoding of data, Pred= icate push down for some predicates plus support for ACID properties.
=

As Alan Gates stated before (Hive user forum, "Difference between ORC = and RC files" , 21 Dec 15) and I quote
=
"Whether ORC is the best format fo= r what you're doing depends on the data you're storing and how you are querying it.=C2=A0 If you are storing data where yo= u know the schema and you are doing analytic type queries it's the best c= hoice (in fairness, some would dispute this and choose Parquet, though much of wh= at I said above (about ORC vs RC applies to Parquet as well).=C2=A0 If you are d= oing queries that select the whole row each time columnar formats like ORC won&#= 39;t be your friend.=C2=A0 Also, if you are storing self structured data such as JS= ON or Avro you may find text or Avro storage to be a better format.

So what would be the main advantage(s) of Parquet over ORC please besides u= sing queries that select whole row (much like "a row based" type = relational database does).


Cheers.



On 6 March 2016 at 15:34, Uli Bethke <uli= .bethke@sonra.io> wrote:
=20 =20 =20
Curious why you think that Parquet does not have metadat at file, row group or column level.
Please refer here to the type of metadata that Parquet supports in the docs http://parquet.apache.org/documentation/latest/
<= div class=3D"h5">

n 06/03/2016 15:26, Mich Talebzadeh wrote:
Hi.

I have been hearing a fair bit about Parquet versus ORC tables.

In a nutshell I can say that Parquet is a predecessor to ORC (both provide columnar type storage) but I notice that it is still being used especially=C2=A0with Spark users.

In mitigation it appears that Spark users are reluctant to use ORC despite the fact that with inbuilt Store Index it offers superior optimisation with data and stats at file, stripe and row group level. Both Parquet and ORC offer=C2=A0SNAPP= Y compression as well. ORC offers=C2=A0ZLIB as default.

There may be=C2=A0other than technical reasons for this adaption, for example too much reliance on Hive plus the fact that=C2=A0it is easier to flatten Parquet than ORC (whatever that means).

I for myself use either text files or ORC with Hive and Spark and don't really see any reason why I should adopt=C2=A0others like=C2=A0Avro, Parquet etc.

Appreciate any verification or experience on this.

Thanks
,


--=20
___________________________
Uli Bethke
Chair Hadoop User Group Ireland
www.hugireland.org<=
/a>
HUG Ireland is community sponsor of Hadoop Summit Europe in Dublin=20
http://2=
016.hadoopsummit.org/dublin/

--001a11433da8baa790052d6387cf--