hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Haviv <daniel.ha...@veracity-group.com>
Subject Re: Merging small files
Date Sat, 17 Oct 2015 15:27:07 GMT
Changed it to sort by.


On Sat, Oct 17, 2015 at 6:05 PM, Daniel Haviv <
daniel.haviv@veracity-group.com> wrote:

> Thanks for the tip Gopal.
> I tried what you suggested (on Tez) but I'm getting a middle stage with 1
> reducer (which is awful for performance).
>
> This is my query:
> insert into upstreamparam_org partition(day_ts, cmtsid) select * from
> upstreamparam_20151013 order by datats,macaddress;
>
> I've attached the query plan in case it might help understand why.
>
> Thank you.
> Daniel.
>
>
>
>
> On Fri, Oct 16, 2015 at 7:19 PM, Gopal Vijayaraghavan <gopalv@apache.org>
> wrote:
>
>>
>> > Is there a more efficient way to have Hive merge small files on the
>> >files without running with two passes?
>>
>> Not entirely an efficient way, but adding a shuffle stage usually works
>> much better as it gives you the ability to layout the files for better
>> vectorization.
>>
>> Like for TPC-H, doing ETL with
>>
>> create table lineitem as select * from lineitem sort by l_shipdate,
>> l_suppkey;
>>
>> will produce fewer files (exactly as many as your reducer #) & compresses
>> harder due to the natural order of transactions (saves ~20Gb or so at 1000
>> scale).
>>
>> Caveat: that is not more efficient in MRv2, only in Tez/Spark which can
>> run MRR pipelines as-is.
>>
>> Cheers,
>> Gopal
>>
>>
>>
>

Mime
View raw message