Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 74C8E1097E for ; Tue, 2 Dec 2014 13:17:05 +0000 (UTC) Received: (qmail 87818 invoked by uid 500); 2 Dec 2014 13:17:03 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 87752 invoked by uid 500); 2 Dec 2014 13:17:03 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 87742 invoked by uid 99); 2 Dec 2014 13:17:03 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Dec 2014 13:17:03 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of matthias.scherer@1und1.de designates 212.227.15.8 as permitted sender) Received: from [212.227.15.8] (HELO mrint.1and1.com) (212.227.15.8) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Dec 2014 13:16:58 +0000 Received: from [172.19.76.163] (helo=BAPPEX002-MBX.united.domain) by mrint.1and1.com with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256) (Exim 4.80) (envelope-from ) id 1XvnJl-000608-2t for user@hive.apache.org; Tue, 02 Dec 2014 14:16:37 +0100 Received: from EXFEINT01.webde.local (10.2.3.91) by BAPPEX002-MBX.united.domain (172.19.76.163) with Microsoft SMTP Server (TLS) id 15.0.995.29; Tue, 2 Dec 2014 14:16:29 +0100 Received: from EXBEA03.webde.local ([fe80::f100:fc29:d7d3:65c0]) by exfeint01.webde.local ([::1]) with mapi id 14.03.0174.001; Tue, 2 Dec 2014 14:16:28 +0100 From: Matthias Scherer To: "user@hive.apache.org" Subject: Merge of compressed RCFile leads to uneven file sizes Thread-Topic: Merge of compressed RCFile leads to uneven file sizes Thread-Index: AdAOMi3+TXTmsZ6OTbe6MIFa43Uwkg== Date: Tue, 2 Dec 2014 13:16:28 +0000 Message-ID: <6CE79C67CF5ED44C8F03F7E91FB68B0B2C4B6686@exbea03.webde.local> Accept-Language: de-DE, en-US Content-Language: de-DE X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [172.19.70.220] Content-Type: multipart/alternative; boundary="_000_6CE79C67CF5ED44C8F03F7E91FB68B0B2C4B6686exbea03webdeloc_" MIME-Version: 1.0 X-Virus-Scanned: ClamAV@mvs-ha-bs X-Virus-Checked: Checked by ClamAV on apache.org --_000_6CE79C67CF5ED44C8F03F7E91FB68B0B2C4B6686exbea03webdeloc_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Hi All, I am trying to merge gzip compressed RCFile output to one single file per p= artition. Hive version is 0.10: SET hive.exec.compress.intermediate=3Dtrue; SET mapred.compress.map.output=3Dtrue; SET mapred.map.output.compression.codec=3Dorg.apache.hadoop.io.compress.Sna= ppyCodec; SET hive.exec.compress.output=3Dtrue; SET mapred.output.compression.codec=3Dorg.apache.hadoop.io.compress.GzipCod= ec; SET mapred.output.compression.type=3DBLOCK; SET hive.merge.mapfiles=3Dtrue; SET hive.merge.mapredfiles=3Dtrue; SET hive.merge.size.per.task=3D256000000; SET hive.merge.smallfiles.avgsize=3D256000000; After adding another partition with "INSERT OVERWRITE TABLE ... PARTITION (= ...) SELECT ...", the output of the Hive job (1 mapreduce job + 1 map-only = merge job) looks like this: 000000_0 file 8.15 MB 000001_0 file 7.88 MB 000002_0 file 5.2 MB ... 000013_0 file 700.56 KB 000014_0 file 574.59 KB Why is the largest file more than 10 times bigger than the smallest? Why ar= e they sorted by filesize descending? And why is it not 1 single file? I tested the same table and Statement also with STORED AS SEQUENCEFILE, and= the result was 1 single output file. Regards Matthias --_000_6CE79C67CF5ED44C8F03F7E91FB68B0B2C4B6686exbea03webdeloc_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

Hi All,

 

I am trying to merge gzip compressed RCFile output t= o one single file per partition. Hive version is 0.10:

 

SET hive.exec.compress.intermediate=3Dtrue;

SET mapred.compress.map.output=3Dtrue;

SET mapred.map.output.compression.codec=3Dorg.apache= .hadoop.io.compress.SnappyCodec;

 

SET hive.exec.compress.output=3Dtrue;

SET mapred.output.compression.codec=3Dorg.apache.had= oop.io.compress.GzipCodec;

SET mapred.output.compression.type=3DBLOCK;

 

SET hive.merge.mapfiles=3Dtrue;

SET hive.merge.mapredfiles=3Dtrue;

SET hive.merge.size.per.task=3D256000000;=

SET hive.merge.smallfiles.avgsize=3D256000000;<= /o:p>

 

After adding another partition with „INSERT OV= ERWRITE TABLE … PARTITION (…) SELECT …“, the output= of the Hive job (1 mapreduce job + 1 map-only merge job) looks like th= is:

 

000000_0        &= nbsp;    file         8.1= 5 MB

000001_0        &= nbsp;    file         7.8= 8 MB

000002_0        &= nbsp;    file         5.2= MB

000013_0        &= nbsp;    file         700= .56 KB

000014_0        &= nbsp;    file         574= .59 KB

 

Why is the largest file more than 10 times bigger th= an the smallest? Why are they sorted by filesize descending? And why is it = not 1 single file?

 

I tested the same table and Statement also with STOR= ED AS SEQUENCEFILE, and the result was 1 single output file.

 

Regards

Matthias

--_000_6CE79C67CF5ED44C8F03F7E91FB68B0B2C4B6686exbea03webdeloc_--