Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 04A75200CBC for ; Tue, 20 Jun 2017 20:42:14 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 02E80160BE1; Tue, 20 Jun 2017 18:42:14 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 21F60160BCC for ; Tue, 20 Jun 2017 20:42:12 +0200 (CEST) Received: (qmail 97458 invoked by uid 500); 20 Jun 2017 18:42:11 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 97448 invoked by uid 99); 20 Jun 2017 18:42:11 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Jun 2017 18:42:11 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 6F9BE1B0F18 for ; Tue, 20 Jun 2017 18:42:11 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.38 X-Spam-Level: ** X-Spam-Status: No, score=2.38 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id RtjxgkHiJB40 for ; Tue, 20 Jun 2017 18:42:09 +0000 (UTC) Received: from mail-lf0-f48.google.com (mail-lf0-f48.google.com [209.85.215.48]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 4D8385F6D3 for ; Tue, 20 Jun 2017 18:42:09 +0000 (UTC) Received: by mail-lf0-f48.google.com with SMTP id p189so80708277lfe.2 for ; Tue, 20 Jun 2017 11:42:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=ziHJzD/PIf2s6rKzvO9OWB9ak36qSs8heFux3ahLRjE=; b=tycwNMBjOaxM1R6YmmdE9udKy3Ew50y6l7HlAm7pDymbo3zVOxsS2U2DwjrcE6plot AybL+Og5EUJ9gaSPXCDflSoHFIisAkvdG3AzTNp0LXf0HV6juFcZ7P1rICWgc2Yx8Y1/ PYjkGzUN/yoR9/jAgLnCT4FzMvLnjQH7INgadGFU7jTpB87p7wJuTssQepnw0f4vll0J +ZsVg2880mK5MuChLwp8hL9xSPGLNS7dXDyVgqtQqkCbXhrlMpkF8HiIaccJtJPq6atw wLOWE3Z//id38BbmH/+ZOUWFtf3VcPsKUy1Z7VQ9QXajQg2ufb4l3WGdqDMj38zcGSs9 SRGg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=ziHJzD/PIf2s6rKzvO9OWB9ak36qSs8heFux3ahLRjE=; b=IQg/kGbthFhHopKuoGv+QAWnuyfYxhPAfOa5QHFYtSlZkOZA3XAkaYf/fSTCaBzFUP vnhPO9LdA3drulJESDQMV+BnQZ4iSQgEg8i+Fq0oMolPDfYjOOUh4bTP614mM/2IYVsG b2aOy5CHLUP6XAr1kUvVS41rukwnTj7tbW2/me1iHtzcAvMelzo04PPjPBlS2xVEL6EA eEiAGQKdGWZwpjAJjo/ZOhzjc606mOq0LHYkoSxpdWuqO9NTivi9BEXQP3dcfmIEyTjb +SXV7VVxsjyMLdW7TKIb7Qp7MVCkEKqbOz3YlG5zMiS/uB7MEONz6lHGLPVkeYkMRqFo Ztzg== X-Gm-Message-State: AKS2vOxLLcUnnKf/aco3GUqZ+vPVoFeqiFv7LDpJ7zGiL5kQbBNzoDXb P8+a7+kdyBc3koB9jsesAHBR2re0Qmsd X-Received: by 10.46.83.75 with SMTP id t11mr8956616ljd.28.1497984122817; Tue, 20 Jun 2017 11:42:02 -0700 (PDT) MIME-Version: 1.0 Received: by 10.25.31.195 with HTTP; Tue, 20 Jun 2017 11:42:02 -0700 (PDT) In-Reply-To: References: From: Edward Capriolo Date: Tue, 20 Jun 2017 14:42:02 -0400 Message-ID: Subject: Re: Format dillema To: "user@hive.apache.org" Content-Type: multipart/alternative; boundary="94eb2c1cf0548511a70552689bb5" archived-at: Tue, 20 Jun 2017 18:42:14 -0000 --94eb2c1cf0548511a70552689bb5 Content-Type: text/plain; charset="UTF-8" "Hive and LLAP do support Parquet precisely because the developers want to be able to process everyone's data." Yes. But there are a number of optimizations on the Hive ORC side that we know are not implemented on the Parquet support. Which is why I made my statement. Impala( Parq=yes, orc=no) Hive (ORC=yes, parq=lame). E.G. https://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/ This requires a reader that is smart enough to understand the predicates. Fortunately ORC has had the corresponding improvements to allow predicates to be pushed into it, and takes advantages of its inline indexes to deliver performance benefits. IE. Universal improvements won't happen. "Part of having a thriving ecosystem is that there are competitors, which creates some user confusion, but makes the ecosystem stronger. " True in many cases. But the fork happy not-invented-here-ness is two much. To the average user: 1) both do the same thing. 2) each vendor has some white paper power point selling you on why their solutions is naturally better/smaller/fast. As it relates the columnar formats, it is silly arms race. Parquet had C/C++ right off the bat of course because impala has to work in C/C++. But hey maybe 2.3 years later someone has a github that does that for ORC, and maybe 3.2 years later someone adds predicate push downs in hive to parquet. In the mean time actual users are stuck in the middle: 1) uses text file anyway because it is the ONLY format all tools support 2) makes two outputs for each query using 2x space (Can someone please make a competitor for Oozie? *grin*) https://github.com/apache/incubator-airflow , mrjobs, luigi, askaban :) On Tue, Jun 20, 2017 at 1:45 PM, Owen O'Malley wrote: > > > On Tue, Jun 20, 2017 at 10:12 AM, Edward Capriolo > wrote: > >> It is whack that two optimized row columnar formats exists and each >> respective project (hive/impala) has good support for one and lame/no >> support for the other. >> > > We have two similar formats because they were designed at roughly the same > time by different teams with similar, but not identical goals. Part of > having a thriving ecosystem is that there are competitors, which creates > some user confusion, but makes the ecosystem stronger. (Can someone please > make a competitor for Oozie? *grin*) > > Hive and LLAP do support Parquet precisely because the developers want to > be able to process everyone's data. The Impala project is free to make > their own choices about what to work on. > > .. Owen > > --94eb2c1cf0548511a70552689bb5 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
"Hive an= d LLAP do support Parquet precisely because the developers want to be able = to process everyone's data."

Yes. But there are a number of optimizations on the Hive ORC= side that we know are not implemented on the Parquet support. Which is why= I made my statement. Impala( Parq=3Dyes, orc=3Dno) Hive (ORC=3Dyes, parq= =3Dlame). E.G.

<= a href=3D"https://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-= better-performance/">https://hortonworks.com/blog/orcfile-in-hdp-2-better-c= ompression-better-performance/

This= requires a reader that is smart enough to understand the predicates. Fortu= nately ORC has had the corresponding improvements to allow predicates to be= pushed into it, and takes advantages of its inline indexes to deliver perf= ormance benefits.

IE. Universal improvements won't happen.=

"Part of having a thri= ving ecosystem is that there are competitors, which creates some user confu= sion, but makes the ecosystem stronger. "

True in many cases. But the fork happy not-invented-here-ness is two= much. To the average user:=C2=A0
1) both do the same thing.=C2=A0
2) each vendor has= some white paper power point selling you on why their solutions is natural= ly better/smaller/fast.=C2=A0

As it relates the columnar formats, it is= silly arms race. Parquet had C/C++ right off the bat of course because imp= ala has to work in C/C++. But hey maybe 2.3 years later someone has a githu= b that does that for ORC, and maybe 3.2 years later someone adds predicate = push downs in hive to parquet.

In the mean time actual users are stuck in the middle:
1) uses= text file anyway because it is the ONLY format all tools support
2) makes two outputs for each query using 2x space


(Can s= omeone please make a competitor for Oozie? *grin*)
= https://github.com/apache/incubator-airflow , mrjobs, luigi, =C2=A0aska= ban :)
--94eb2c1cf0548511a70552689bb5--