hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <>
Subject Re: Format dillema
Date Fri, 23 Jun 2017 18:53:47 GMT

> It is not that simple. The average Hadoop user has years 6-7 of data. They do not have
a "magic" convert everything button. They also have legacy processes that don't/can't be converted.

> They do not want the "fastest format" they want "the fastest hive for their data".

I've yet to run into that sort of luddite yet - maybe engineers can hold onto an opinion like
that in isolation, but businesses are in general cost sensitive when it comes to storage &

The cynic in me says that if there are a few more down rounds, ORC adoption will suddenly
skyrocket in companies which hoard data.

ORC has massive compression advantages over Text, especially for attribute+metric SQL data.
A closer look at this is warranted.

Some of this stuff literally blows my mind - customer_demographics in TPC-DS is a great example
of doing the impossible.

tpcds_bin_partitioned_orc_1000.customer_demographics [numFiles=1, numRows=1920800, totalSize=46194,

which makes it 0.19 bit per-row (not byte, *BIT*).

Compare to Parquet (which is still far better than text)

tpcds_bin_partitioned_parq_1000.customer_demographics  [numFiles=1, numRows=1920800, totalSize=16813614,

which uses 70 bits per-row.

So as companies "age" in their data over years, they tend to be very receptive to the idea
of switching their years old data to ORC (and then use tiered HDFS etc).

Still no magic button, but apparently money is a strong incentive to solve hard problems.

> They get data dumps from potentially non sophisticated partners maybe using S3 and csv
and, cause maybe their partner uses vertica or redshift. I think you understand this.

That is something I'm painfully aware of - after the first few months, the second request
is "Can you do Change-Data-Capture, so that we can reload every 30 mins? Can we do every 5

And that's why Hive ACID has got SQL MERGE statements, so that you can grab a ChangeLog and
apply it over with an UPSERT/UPDATE LATEST. And unlike the old lock manager, writes don't
lock out any readers.

Then as the warehouse gets bigger, "can you prevent the UPSERT from thrashing my cache &
IO? Because the more data I have in the warehouse the longer the update takes."

And that's what the min-max SemiJoin reduction in Tez does (i.e the min/max from the changelog
goes pushed into the ORC index on the target table scan, so that only the intersection is
loaded into cache). We gather a runtime range from the updates and push it to the ACID base,
so that we don't have to read data into memory that doesn't have any updates.

Also, if you have a sequential primary key on the OLTP side, this comes in as a ~100x speed
up for such a use-case … because ACID ORC has transaction-consistent indexes built-in.

> Suppose you have 100 GB text data in an S3 bucket, and say queying it takes lets just
say "50 seconds for a group by type query".
> Now that second copy..Maybe I can do the same group by in 30 seconds. 

You're off by a couple of orders of magnitude - in fact, that was my last year's Hadoop Summit
demo, 10 terabytes of Text on S3, converted to ORC + LLAP. (GIANT 38Mb GIF)

That's doing nearly a billion rows a second across 9 nodes, through a join + group-by - a
year ago. You can probably hit 720M rows/sec with plain Text with latest LLAP on the same
cluster today.

And with LLAP, adding S3 SSE (encrypted data on S3) adds a ~4% overhead for ORC, which is
another neat trick. And with S3Guard, we have the potential to get the consistency needed
for ACID.

The format improvements are foundational to the cost-effectiveness on the cloud - you can
see the impact of the format on the IO costs when you use a non-Hive engine like AWS Athena
with ORC and Parquet [1].

> 1) io bound 
> 2) have 10 seconds of startup time anyway. 

LLAP is memory bandwidth bound today, because the IO costs are so low & is hidden by async
IO - the slowest part of LLAP for a BI query is the amount of time it takes to convert the
cache structures into vectorized rows.

Part of it is due to the fact ORC is really tightly compressed and decode loops need to get
more instructions-per-clock. Some parts of ORC decompression can be faster than a raw memcpy
of the original data, because of cache access patterns (rather, writing sequentially to the
same buffers again is faster than writing to a new location). If the focus on sequential writes
makes you think of disks, this is why memory is treated as the new disk.

The startup time for something like Tableau is actually closer to 240ms (& that can come
down to 90ms if we disable failure tolerance).

We've got sub-second SQL execution, sub-second compiles, sub-second submissions … with all
of it adding up to a single or double digit seconds over a billion rows of data.

[1] -

View raw message