hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Haviv <daniel.ha...@veracity-group.com>
Subject Re: Merging small files
Date Sat, 17 Oct 2015 15:05:58 GMT
Thanks for the tip Gopal.
I tried what you suggested (on Tez) but I'm getting a middle stage with 1
reducer (which is awful for performance).

This is my query:
insert into upstreamparam_org partition(day_ts, cmtsid) select * from
upstreamparam_20151013 order by datats,macaddress;

I've attached the query plan in case it might help understand why.

Thank you.
Daniel.




On Fri, Oct 16, 2015 at 7:19 PM, Gopal Vijayaraghavan <gopalv@apache.org>
wrote:

>
> > Is there a more efficient way to have Hive merge small files on the
> >files without running with two passes?
>
> Not entirely an efficient way, but adding a shuffle stage usually works
> much better as it gives you the ability to layout the files for better
> vectorization.
>
> Like for TPC-H, doing ETL with
>
> create table lineitem as select * from lineitem sort by l_shipdate,
> l_suppkey;
>
> will produce fewer files (exactly as many as your reducer #) & compresses
> harder due to the natural order of transactions (saves ~20Gb or so at 1000
> scale).
>
> Caveat: that is not more efficient in MRv2, only in Tez/Spark which can
> run MRR pipelines as-is.
>
> Cheers,
> Gopal
>
>
>

Mime
View raw message