hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <gop...@apache.org>
Subject Re: Merging small files
Date Fri, 16 Oct 2015 16:19:22 GMT

> Is there a more efficient way to have Hive merge small files on the
>files without running with two passes?

Not entirely an efficient way, but adding a shuffle stage usually works
much better as it gives you the ability to layout the files for better
vectorization.

Like for TPC-H, doing ETL with

create table lineitem as select * from lineitem sort by l_shipdate,
l_suppkey;

will produce fewer files (exactly as many as your reducer #) & compresses
harder due to the natural order of transactions (saves ~20Gb or so at 1000
scale).

Caveat: that is not more efficient in MRv2, only in Tez/Spark which can
run MRR pipelines as-is.

Cheers,
Gopal



Mime
View raw message