hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <>
Subject Re: Format dillema
Date Tue, 20 Jun 2017 21:05:19 GMT

> 1) both do the same thing. 

The start of this thread is the exact opposite - trying to suggest ORC is better for storage
& wanting to use it.

> As it relates the columnar formats, it is silly arms race. 

I'm not sure "silly" is the operative word - we've lost a lot of fragmentation of the community
and are down to 2 good choices, neither of them wrong.

Impala's original format was Trevni, which lives on in Avro docs. And there was RCFile - a
sequence file format, which stored columnar data in a <K,V> pair. And then there was
LazySimple SequenceFile, LazyBinary SequenceFile, Avro and Text with many SerDes.

Purely speculatively, we're headed into more fragmentation again, with people rediscovering
that they need updates.

Uber's Hoodie is the Parquet fork, but for Spark, not Impala. While ORC ACID is getting much
easier to update with MERGE statements and a deadlock aware txn manager.

> Parquet had C/C++ right off the bat of course because impala has to work in C/C++.

I think that is the primary reason why the Java Parquet readers are still way behind in performance.

Nobody sane wants to work on performance tuning a data reader library in Java, when it is
so much easier to do it in C++.

Doing C++ after tuning the format for optimal performance in Java8 makes a lot of sense, in
hindsight. The marshmallow test is easier if you can't have a marshmallow now.

> 1) uses text file anyway because it is the ONLY format all tools support

I see this often, folks who just throw in plain text into S3 and querying it.

Hive 3.x branch has text vectorization and LLAP cache support for it, so hopefully the only
relevant concern about Text will be the storage costs due to poor compression (& the lack
of updates).


View raw message