Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 9DB50200CB0 for ; Fri, 23 Jun 2017 20:53:52 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 9C89C160BE5; Fri, 23 Jun 2017 18:53:52 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E0D52160BE2 for ; Fri, 23 Jun 2017 20:53:51 +0200 (CEST) Received: (qmail 12820 invoked by uid 500); 23 Jun 2017 18:53:50 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 12811 invoked by uid 99); 23 Jun 2017 18:53:50 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 23 Jun 2017 18:53:50 +0000 Received: from [10.11.3.154] (unknown [192.175.27.10]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 9B0DC1A00A2 for ; Fri, 23 Jun 2017 18:53:48 +0000 (UTC) User-Agent: Microsoft-MacOutlook/f.19.0.160817 Date: Fri, 23 Jun 2017 11:53:47 -0700 Subject: Re: Format dillema From: Gopal Vijayaraghavan Sender: Gopal Vijayaraghavan To: "user@hive.apache.org" Message-ID: Thread-Topic: Format dillema References: <5D626FB4-8A84-40A1-A327-5172CB640145@hortonworks.com> In-Reply-To: Mime-version: 1.0 Content-type: text/plain; charset="UTF-8" Content-transfer-encoding: quoted-printable archived-at: Fri, 23 Jun 2017 18:53:52 -0000 > It is not that simple. The average Hadoop user has years 6-7 of data. The= y do not have a "magic" convert everything button. They also have legacy pro= cesses that don't/can't be converted.=20 =E2=80=A6 > They do not want the "fastest format" they want "the fastest hive for the= ir data". I've yet to run into that sort of luddite yet - maybe engineers can hold on= to an opinion like that in isolation, but businesses are in general cost sen= sitive when it comes to storage & compute.=20 The cynic in me says that if there are a few more down rounds, ORC adoption= will suddenly skyrocket in companies which hoard data. ORC has massive compression advantages over Text, especially for attribute+= metric SQL data. A closer look at this is warranted. Some of this stuff literally blows my mind - customer_demographics in TPC-D= S is a great example of doing the impossible. tpcds_bin_partitioned_orc_1000.customer_demographics [numFiles=3D1, numRows=3D1= 920800, totalSize=3D46194, rawDataSize=3D726062400] which makes it 0.19 bit per-row (not byte, *BIT*). Compare to Parquet (which is still far better than text) tpcds_bin_partitioned_parq_1000.customer_demographics [numFiles=3D1, numRows= =3D1920800, totalSize=3D16813614, rawDataSize=3D17287200] which uses 70 bits per-row. So as companies "age" in their data over years, they tend to be very recept= ive to the idea of switching their years old data to ORC (and then use tiere= d HDFS etc). Still no magic button, but apparently money is a strong incentive to solve = hard problems. > They get data dumps from potentially non sophisticated partners maybe usi= ng S3 and csv and, cause maybe their partner uses vertica or redshift. I thi= nk you understand this. That is something I'm painfully aware of - after the first few months, the = second request is "Can you do Change-Data-Capture, so that we can reload eve= ry 30 mins? Can we do every 5 minutes?". And that's why Hive ACID has got SQL MERGE statements, so that you can grab= a ChangeLog and apply it over with an UPSERT/UPDATE LATEST. And unlike the = old lock manager, writes don't lock out any readers. Then as the warehouse gets bigger, "can you prevent the UPSERT from thrashi= ng my cache & IO? Because the more data I have in the warehouse the longer t= he update takes." And that's what the min-max SemiJoin reduction in Tez does (i.e the min/max= from the changelog goes pushed into the ORC index on the target table scan,= so that only the intersection is loaded into cache). We gather a runtime ra= nge from the updates and push it to the ACID base, so that we don't have to = read data into memory that doesn't have any updates. Also, if you have a sequential primary key on the OLTP side, this comes in = as a ~100x speed up for such a use-case =E2=80=A6 because ACID ORC has transaction= -consistent indexes built-in. > Suppose you have 100 GB text data in an S3 bucket, and say queying it tak= es lets just say "50 seconds for a group by type query". =E2=80=A6=20 > Now that second copy..Maybe I can do the same group by in 30 seconds.=20 You're off by a couple of orders of magnitude - in fact, that was my last y= ear's Hadoop Summit demo, 10 terabytes of Text on S3, converted to ORC + LLA= P. http://people.apache.org/~gopalv/LLAP-S3.gif (GIANT 38Mb GIF) That's doing nearly a billion rows a second across 9 nodes, through a join = + group-by - a year ago. You can probably hit 720M rows/sec with plain Text = with latest LLAP on the same cluster today. And with LLAP, adding S3 SSE (encrypted data on S3) adds a ~4% overhead for= ORC, which is another neat trick. And with S3Guard, we have the potential t= o get the consistency needed for ACID. The format improvements are foundational to the cost-effectiveness on the c= loud - you can see the impact of the format on the IO costs when you use a n= on-Hive engine like AWS Athena with ORC and Parquet [1]. > 1) io bound=20 > 2) have 10 seconds of startup time anyway.=20 LLAP is memory bandwidth bound today, because the IO costs are so low & is = hidden by async IO - the slowest part of LLAP for a BI query is the amount o= f time it takes to convert the cache structures into vectorized rows. Part of it is due to the fact ORC is really tightly compressed and decode l= oops need to get more instructions-per-clock. Some parts of ORC decompressio= n can be faster than a raw memcpy of the original data, because of cache acc= ess patterns (rather, writing sequentially to the same buffers again is fast= er than writing to a new location). If the focus on sequential writes makes = you think of disks, this is why memory is treated as the new disk. The startup time for something like Tableau is actually closer to 240ms (& = that can come down to 90ms if we disable failure tolerance). We've got sub-second SQL execution, sub-second compiles, sub-second submiss= ions =E2=80=A6 with all of it adding up to a single or double digit seconds over a= billion rows of data. Cheers, Gopal [1] - http://tech.marksblogg.com/billion-nyc-taxi-rides-aws-athena.html