hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohit Gupta <>
Subject CombineHiveInputFormat and Merge files not working for compressed text files
Date Wed, 30 Nov 2011 07:18:21 GMT
Hi All,
I am using hive 0.7 on Amazon EMR. I need to merge a large number of small
files into a few larger files( basically merging a number of partitions for
a table into one). On doing the obvious query, i.e.( insert into a new
partition select * from all partitions), a large number of small files are
generated in the new partition. ( map-only job with no of output files
equal to the no of mappers).

Note: The table being processed here is stored in compressed format on s3.
set hive.exec.compress.output = true;
set mapred.output.compression.codec =;
set io.seqfile.compression.type = BLOCK;

I found a couple of solutions on net but sadly neither of them work for me:
1. Merging small files
I set the following parameters:
set hive.merge.mapfiles=true;
set hive.merge.size.per.task=256000000;
set hive.merge.smallfiles.avgsize=100000000;
set hive.merge.mapredfiles=true;
 set hive.merge.smallfiles.avgsize=1000000000;
 set hive.merge.size.smallfiles.avgsize=1000000000;

Ideally, there should have been a reduce job after the map-only job to
merge the small output files into a small no. of files. But, I could see no
reduce job.

2. Using CombineHiveInputFormat
Parameters Set:
set mapred.min.split.size.per.node=1000000000;
set mapred.min.split.size.per.rack=1000000000;
 set mapred.max.split.size=1000000000;

Ideally, here the no. of mappers created should have been considerably less
than the no of input files, thereby producing a small no. of output files
equal to the no. of mappers. But, I found the same no of mappers as no of
input files.

Approx size of small files: 125 KB
No of small files >6k

I found a couple of links saying that this merging stuff did not work for
compressed files but now it is fixed.
Any ideas how can I fix this!

Thanks in Advance.

Best Regards,

Mohit Gupta
Software Engineer at Vdopia Inc.

View raw message