Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: pass (athena.apache.org: domain of matthias.scherer@1und1.de
 designates 212.227.15.8 as permitted sender)
From: Matthias Scherer <matthias.scherer@1und1.de>
To: "user@hive.apache.org" <user@hive.apache.org>
Subject: Merge of compressed RCFile leads to uneven file sizes
Thread-Topic: Merge of compressed RCFile leads to uneven file sizes
Thread-Index: AdAOMi3+TXTmsZ6OTbe6MIFa43Uwkg==
Date: Tue, 2 Dec 2014 13:16:28 +0000
Message-ID: <6CE79C67CF5ED44C8F03F7E91FB68B0B2C4B6686@exbea03.webde.local>
Accept-Language: de-DE, en-US
Content-Language: de-DE
Content-Type: multipart/alternative;
	boundary="_000_6CE79C67CF5ED44C8F03F7E91FB68B0B2C4B6686exbea03webdeloc_"
MIME-Version: 1.0

--_000_6CE79C67CF5ED44C8F03F7E91FB68B0B2C4B6686exbea03webdeloc_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Hi All,

I am trying to merge gzip compressed RCFile output to one single file per p=
artition. Hive version is 0.10:

SET hive.exec.compress.intermediate=3Dtrue;
SET mapred.compress.map.output=3Dtrue;
SET mapred.map.output.compression.codec=3Dorg.apache.hadoop.io.compress.Sna=
ppyCodec;

SET hive.exec.compress.output=3Dtrue;
SET mapred.output.compression.codec=3Dorg.apache.hadoop.io.compress.GzipCod=
ec;
SET mapred.output.compression.type=3DBLOCK;

SET hive.merge.mapfiles=3Dtrue;
SET hive.merge.mapredfiles=3Dtrue;
SET hive.merge.size.per.task=3D256000000;
SET hive.merge.smallfiles.avgsize=3D256000000;

After adding another partition with "INSERT OVERWRITE TABLE ... PARTITION (=
...) SELECT ...", the output of the Hive job (1 mapreduce job + 1 map-only =
merge job) looks like this:

000000_0             file         8.15 MB
000001_0             file         7.88 MB
000002_0             file         5.2 MB
...
000013_0             file         700.56 KB
000014_0             file         574.59 KB

Why is the largest file more than 10 times bigger than the smallest? Why ar=
e they sorted by filesize descending? And why is it not 1 single file?

I tested the same table and Statement also with STORED AS SEQUENCEFILE, and=
 the result was 1 single output file.

Regards
Matthias

--_000_6CE79C67CF5ED44C8F03F7E91FB68B0B2C4B6686exbea03webdeloc_
Content-Type: text/html; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<html xmlns:v=3D"urn:schemas-microsoft-com:vml" xmlns:o=3D"urn:schemas-micr=
osoft-com:office:office" xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" xmlns=3D"http:=
//www.w3.org/TR/REC-html40">
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dus-ascii"=
>
<meta name=3D"Generator" content=3D"Microsoft Word 12 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
	{font-family:"Cambria Math";
	panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0cm;
	margin-bottom:.0001pt;
	font-size:11.0pt;
	font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:blue;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:purple;
	text-decoration:underline;}
span.E-MailFormatvorlage17
	{mso-style-type:personal-compose;
	font-family:"Calibri","sans-serif";
	color:windowtext;}
.MsoChpDefault
	{mso-style-type:export-only;}
@page WordSection1
	{size:612.0pt 792.0pt;
	margin:70.85pt 70.85pt 2.0cm 70.85pt;}
div.WordSection1
	{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext=3D"edit">
<o:idmap v:ext=3D"edit" data=3D"1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang=3D"DE" link=3D"blue" vlink=3D"purple">
<div class=3D"WordSection1">
<p class=3D"MsoNormal">Hi All,<o:p></o:p></p>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
<p class=3D"MsoNormal">I am trying to merge gzip compressed RCFile output t=
o one single file per partition. Hive version is 0.10:<o:p></o:p></p>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
<p class=3D"MsoNormal">SET hive.exec.compress.intermediate=3Dtrue;<o:p></o:=
p></p>
<p class=3D"MsoNormal">SET mapred.compress.map.output=3Dtrue;<o:p></o:p></p=
>
<p class=3D"MsoNormal">SET mapred.map.output.compression.codec=3Dorg.apache=
.hadoop.io.compress.SnappyCodec;<o:p></o:p></p>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
<p class=3D"MsoNormal">SET hive.exec.compress.output=3Dtrue;<o:p></o:p></p>
<p class=3D"MsoNormal">SET mapred.output.compression.codec=3Dorg.apache.had=
oop.io.compress.GzipCodec;<o:p></o:p></p>
<p class=3D"MsoNormal">SET mapred.output.compression.type=3DBLOCK;<o:p></o:=
p></p>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
<p class=3D"MsoNormal">SET hive.merge.mapfiles=3Dtrue;<o:p></o:p></p>
<p class=3D"MsoNormal">SET hive.merge.mapredfiles=3Dtrue;<o:p></o:p></p>
<p class=3D"MsoNormal">SET hive.merge.size.per.task=3D256000000;<o:p></o:p>=
</p>
<p class=3D"MsoNormal">SET hive.merge.smallfiles.avgsize=3D256000000;<o:p><=
/o:p></p>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
<p class=3D"MsoNormal">After adding another partition with &#8222;INSERT OV=
ERWRITE TABLE &#8230; PARTITION (&#8230;) SELECT &#8230;&#8220;, the output=
 of the Hive job (1 mapreduce job &#43; 1 map-only merge job) looks like th=
is:<o:p></o:p></p>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
<p class=3D"MsoNormal">000000_0 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&=
nbsp;&nbsp;&nbsp;&nbsp; file &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 8.1=
5 MB <o:p></o:p></p>
<p class=3D"MsoNormal">000001_0 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&=
nbsp;&nbsp;&nbsp;&nbsp; file &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 7.8=
8 MB <o:p></o:p></p>
<p class=3D"MsoNormal">000002_0 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&=
nbsp;&nbsp;&nbsp;&nbsp; file &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 5.2=
 MB <o:p></o:p></p>
<p class=3D"MsoNormal">&#8230;<o:p></o:p></p>
<p class=3D"MsoNormal">000013_0 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&=
nbsp;&nbsp;&nbsp;&nbsp; file &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 700=
.56 KB <o:p></o:p></p>
<p class=3D"MsoNormal">000014_0 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&=
nbsp;&nbsp;&nbsp;&nbsp; file &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 574=
.59 KB <o:p></o:p></p>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
<p class=3D"MsoNormal">Why is the largest file more than 10 times bigger th=
an the smallest? Why are they sorted by filesize descending? And why is it =
not 1 single file?<o:p></o:p></p>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
<p class=3D"MsoNormal">I tested the same table and Statement also with STOR=
ED AS SEQUENCEFILE, and the result was 1 single output file.<o:p></o:p></p>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
<p class=3D"MsoNormal">Regards<o:p></o:p></p>
<p class=3D"MsoNormal">Matthias<o:p></o:p></p>
</div>
</body>
</html>

--_000_6CE79C67CF5ED44C8F03F7E91FB68B0B2C4B6686exbea03webdeloc_--