hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthias Scherer <matthias.sche...@1und1.de>
Subject Merge of compressed RCFile leads to uneven file sizes
Date Tue, 02 Dec 2014 13:16:28 GMT
Hi All,

I am trying to merge gzip compressed RCFile output to one single file per partition. Hive
version is 0.10:

SET hive.exec.compress.intermediate=true;
SET mapred.compress.map.output=true;
SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET mapred.output.compression.type=BLOCK;

SET hive.merge.mapfiles=true;
SET hive.merge.mapredfiles=true;
SET hive.merge.size.per.task=256000000;
SET hive.merge.smallfiles.avgsize=256000000;

After adding another partition with "INSERT OVERWRITE TABLE ... PARTITION (...) SELECT ...",
the output of the Hive job (1 mapreduce job + 1 map-only merge job) looks like this:

000000_0             file         8.15 MB
000001_0             file         7.88 MB
000002_0             file         5.2 MB
...
000013_0             file         700.56 KB
000014_0             file         574.59 KB

Why is the largest file more than 10 times bigger than the smallest? Why are they sorted by
filesize descending? And why is it not 1 single file?

I tested the same table and Statement also with STORED AS SEQUENCEFILE, and the result was
1 single output file.

Regards
Matthias

Mime
View raw message