Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
User-Agent: Microsoft-MacOutlook/f.19.0.160817
Date: Fri, 23 Jun 2017 11:53:47 -0700
Subject: Re: Format dillema
From: Gopal Vijayaraghavan <gopalv@apache.org>
Sender: Gopal Vijayaraghavan <gopal@hortonworks.com>
To: "user@hive.apache.org" <user@hive.apache.org>
Message-ID: <B719AEAB-D33A-4A3C-8E6B-2342BD25C4F6@hortonworks.com>
Thread-Topic: Format dillema
References: <BMXPR01MB0791908F086CDA94C566AFC8C8C70@BMXPR01MB0791.INDPRD01.PROD.OUTLOOK.COM>
 <CADE81M5RLy7NFO-o3aNYkoQicjKxSkDv4Htb7Ky_ja5bEDOs+Q@mail.gmail.com>
 <CADZ2jRCuU+nV2k=1j11GBr_k6KsnWXS8usiM=6SVhU_D9tjxWQ@mail.gmail.com>
 <CAHfHakHaQ_257_wa9PEPbBVvn2NaBHxTvwXj1DpFZQ+Vvd+=Ew@mail.gmail.com>
 <CAENxBwzDetLtekgeecS0eX47dYnCjiVxZ-reOZsgQ+a_Y_t0rg@mail.gmail.com>
 <CAHfHakF2SWOkpdxwEgvMMoZnq0MU6y33OEXRUNa_KaSyLTE7cQ@mail.gmail.com>
 <CAENxBwzwauYNmeiVWgx=S6Mf7JtNogQ-+hqt1Lvmp=Eok=A_Jw@mail.gmail.com>
 <AFA7EFBE-3826-474F-B55A-E3D5E4FA6B76@hortonworks.com>
 <CAENxBwwgpV2PdSF4x8OfxQz9FjBr0EdEXGCg+zeT25VekS_9Jg@mail.gmail.com>
 <5D626FB4-8A84-40A1-A327-5172CB640145@hortonworks.com>
 <CAENxBwxPs-AFF_BUdS6MaQaNP_jmPvMm=yRV9fh22vu04W0zWA@mail.gmail.com>
In-Reply-To: <CAENxBwxPs-AFF_BUdS6MaQaNP_jmPvMm=yRV9fh22vu04W0zWA@mail.gmail.com>
Mime-version: 1.0
Content-type: text/plain;
	charset="UTF-8"
Content-transfer-encoding: quoted-printable
archived-at: Fri, 23 Jun 2017 18:53:52 -0000


> It is not that simple. The average Hadoop user has years 6-7 of data. The=
y do not have a "magic" convert everything button. They also have legacy pro=
cesses that don't/can't be converted.=20
=E2=80=A6
> They do not want the "fastest format" they want "the fastest hive for the=
ir data".

I've yet to run into that sort of luddite yet - maybe engineers can hold on=
to an opinion like that in isolation, but businesses are in general cost sen=
sitive when it comes to storage & compute.=20

The cynic in me says that if there are a few more down rounds, ORC adoption=
 will suddenly skyrocket in companies which hoard data.

ORC has massive compression advantages over Text, especially for attribute+=
metric SQL data. A closer look at this is warranted.

Some of this stuff literally blows my mind - customer_demographics in TPC-D=
S is a great example of doing the impossible.

tpcds_bin_partitioned_orc_1000.customer_demographics [numFiles=3D1, numRows=3D1=
920800, totalSize=3D46194, rawDataSize=3D726062400]

which makes it 0.19 bit per-row (not byte, *BIT*).

Compare to Parquet (which is still far better than text)

tpcds_bin_partitioned_parq_1000.customer_demographics  [numFiles=3D1, numRows=
=3D1920800, totalSize=3D16813614, rawDataSize=3D17287200]

which uses 70 bits per-row.

So as companies "age" in their data over years, they tend to be very recept=
ive to the idea of switching their years old data to ORC (and then use tiere=
d HDFS etc).

Still no magic button, but apparently money is a strong incentive to solve =
hard problems.

> They get data dumps from potentially non sophisticated partners maybe usi=
ng S3 and csv and, cause maybe their partner uses vertica or redshift. I thi=
nk you understand this.

That is something I'm painfully aware of - after the first few months, the =
second request is "Can you do Change-Data-Capture, so that we can reload eve=
ry 30 mins? Can we do every 5 minutes?".

And that's why Hive ACID has got SQL MERGE statements, so that you can grab=
 a ChangeLog and apply it over with an UPSERT/UPDATE LATEST. And unlike the =
old lock manager, writes don't lock out any readers.

Then as the warehouse gets bigger, "can you prevent the UPSERT from thrashi=
ng my cache & IO? Because the more data I have in the warehouse the longer t=
he update takes."

And that's what the min-max SemiJoin reduction in Tez does (i.e the min/max=
 from the changelog goes pushed into the ORC index on the target table scan,=
 so that only the intersection is loaded into cache). We gather a runtime ra=
nge from the updates and push it to the ACID base, so that we don't have to =
read data into memory that doesn't have any updates.

Also, if you have a sequential primary key on the OLTP side, this comes in =
as a ~100x speed up for such a use-case =E2=80=A6 because ACID ORC has transaction=
-consistent indexes built-in.

> Suppose you have 100 GB text data in an S3 bucket, and say queying it tak=
es lets just say "50 seconds for a group by type query".
=E2=80=A6=20
> Now that second copy..Maybe I can do the same group by in 30 seconds.=20

You're off by a couple of orders of magnitude - in fact, that was my last y=
ear's Hadoop Summit demo, 10 terabytes of Text on S3, converted to ORC + LLA=
P.

http://people.apache.org/~gopalv/LLAP-S3.gif (GIANT 38Mb GIF)

That's doing nearly a billion rows a second across 9 nodes, through a join =
+ group-by - a year ago. You can probably hit 720M rows/sec with plain Text =
with latest LLAP on the same cluster today.

And with LLAP, adding S3 SSE (encrypted data on S3) adds a ~4% overhead for=
 ORC, which is another neat trick. And with S3Guard, we have the potential t=
o get the consistency needed for ACID.

The format improvements are foundational to the cost-effectiveness on the c=
loud - you can see the impact of the format on the IO costs when you use a n=
on-Hive engine like AWS Athena with ORC and Parquet [1].

> 1) io bound=20
> 2) have 10 seconds of startup time anyway.=20

LLAP is memory bandwidth bound today, because the IO costs are so low & is =
hidden by async IO - the slowest part of LLAP for a BI query is the amount o=
f time it takes to convert the cache structures into vectorized rows.

Part of it is due to the fact ORC is really tightly compressed and decode l=
oops need to get more instructions-per-clock. Some parts of ORC decompressio=
n can be faster than a raw memcpy of the original data, because of cache acc=
ess patterns (rather, writing sequentially to the same buffers again is fast=
er than writing to a new location). If the focus on sequential writes makes =
you think of disks, this is why memory is treated as the new disk.

The startup time for something like Tableau is actually closer to 240ms (& =
that can come down to 90ms if we disable failure tolerance).

We've got sub-second SQL execution, sub-second compiles, sub-second submiss=
ions =E2=80=A6 with all of it adding up to a single or double digit seconds over a=
 billion rows of data.

Cheers,
Gopal
[1] - http://tech.marksblogg.com/billion-nyc-taxi-rides-aws-athena.html