Return-Path: Delivered-To: apmail-hive-user-archive@www.apache.org Received: (qmail 20887 invoked from network); 18 Nov 2010 21:12:52 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 18 Nov 2010 21:12:52 -0000 Received: (qmail 27362 invoked by uid 500); 18 Nov 2010 21:13:23 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 27338 invoked by uid 500); 18 Nov 2010 21:13:22 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 27330 invoked by uid 99); 18 Nov 2010 21:13:22 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Nov 2010 21:13:22 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of nzhang@fb.com designates 66.220.144.137 as permitted sender) Received: from [66.220.144.137] (HELO mx-out.facebook.com) (66.220.144.137) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Nov 2010 21:13:16 +0000 Received: from [192.168.18.212] ([192.168.18.212:57223] helo=mail.thefacebook.com) by mta007.snc4.facebook.com (envelope-from ) (ecelerity 2.2.2.45 r(37388)) with ESMTP id 9D/FB-01280-4D695EC4; Thu, 18 Nov 2010 13:12:54 -0800 Received: from SC-MBX04.TheFacebook.com ([169.254.3.91]) by sc-hub04.TheFacebook.com ([192.168.18.212]) with mapi id 14.01.0218.012; Thu, 18 Nov 2010 13:12:52 -0800 From: Ning Zhang To: "" Subject: Re: Hive produces very small files despite hive.merge...=true settings Thread-Topic: Hive produces very small files despite hive.merge...=true settings Thread-Index: AQHLhsR4vBxZqNnJEUmo+7rhBAgU+ZN4Q0qA Date: Thu, 18 Nov 2010 21:12:52 +0000 Message-ID: <727149DB-5FAA-47F7-A74C-2DCE0E03CFB1@fb.com> References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [192.168.18.252] Content-Type: text/plain; charset="us-ascii" Content-ID: <076FB148DC5F3C43AB0D29DA59DFB58C@fb.com> Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 The settings looks good. The parameter hive.merge.size.smallfiles.avgsize i= s used to determine at run time if a merge should be triggered: if the aver= age size of the files in the partition is SMALLER than the parameter and th= ere are more than 1 file, the merge should be scheduled. Can you try to see= if you have any big files as well in your resulting partition? If it is be= cause of a very large file, you can set the parameter large enough. Another possibility is that your Hadoop installation does not support Combi= neHiveInputFormat, which is used for the new merge job. Someone reported pr= eviously merge was not successful because of this. If that's the case, you = can turn off CombineHiveInputFormat and use the old HiveInputFormat (though= slower) by setting hive.mergejob.maponly=3Dfalse.=20 Ning On Nov 17, 2010, at 6:00 PM, Leo Alekseyev wrote: > I have jobs that sample (or generate) a small amount of data from a > large table. At the end, I get e.g. about 3000 or more files of 1kb > or so. This becomes a nuisance. How can I make Hive do another pass > to merge the output? I have the following settings: >=20 > hive.merge.mapfiles=3Dtrue > hive.merge.mapredfiles=3Dtrue > hive.merge.size.per.task=3D256000000 > hive.merge.size.smallfiles.avgsize=3D16000000 >=20 > After setting hive.merge* to true, Hive started indicating "Total > MapReduce jobs =3D 2". However, after generating the > lots-of-small-files table, Hive says: > Ended Job =3D job_201011021934_1344 > Ended Job =3D 781771542, job is filtered out (removed at runtime). >=20 > Is there a way to force the merge, or am I missing something? > --Leo