hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Loudongfeng <>
Subject RE: Reduce number of Hadoop mappers for large number of GZ files
Date Tue, 05 Apr 2016 06:32:29 GMT
You can set hive.hadoop.supports.splittable.combineinputformat=true to combine your files.
In fact ,this parameter should be set to true by default since MAPREDUCE-1597 was fixed is
hadoop 0.22.0 long ago.

From: Harshit Sharan []
Sent: Saturday, April 02, 2016 4:06 PM
Subject: Reduce number of Hadoop mappers for large number of GZ files


I have a use case where I have 3072 gz files over which I am building a HIVE table. Now, whenever
I run a query over this table, the query spawns 3072 mappers, and takes around 44 mins to
complete. Earlier, the same data (i.e. equal data size) was present in 384 files. The same
queries took around 9 mins only.

I searched the web, where I found that the number of mappers are decided by the number of
"splits" of the i/p data. Hence, setting the parameters: mapreduce.input.fileinputformat.split.minsize
and mapreduce.input.fileinputformat.split.maxsize

to a high value like 64 MB would cause each mapper to take up 64 MB worth of data, even if
that requires processing multiple files by same mapper.

But, this solution doesn't work for my case, since GZ files are of "non-splittable" format.
Hence, they can not be split across mappers or joined to be processed by a single mapper.

Has anyone faced this problem too?

There can be various solutions to this, like uncompressing the gz files and then using above
params to have lesser number of mappers, or using higher end ec2 instances to reduce processing
time. But, is there an inherent solution in Hadoop/Hive/EMR to tackle this?

Thanks in advance for any help!
Harshit Sharan
Software Development Engineer
View raw message