Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 80BA21049D for ; Wed, 31 Jul 2013 16:22:28 +0000 (UTC) Received: (qmail 27797 invoked by uid 500); 31 Jul 2013 16:22:25 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 24423 invoked by uid 500); 31 Jul 2013 16:22:19 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 24409 invoked by uid 99); 31 Jul 2013 16:22:17 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Jul 2013 16:22:17 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of mailinglists19@gmail.com designates 74.125.82.172 as permitted sender) Received: from [74.125.82.172] (HELO mail-we0-f172.google.com) (74.125.82.172) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Jul 2013 16:22:13 +0000 Received: by mail-we0-f172.google.com with SMTP id t61so800995wes.17 for ; Wed, 31 Jul 2013 09:21:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=ytOLmUw/s2628NX+R4DjrmgygwQNHYNngUjCEZHXLso=; b=XPBO7k2I+NMftVKTTnyyVCwHyfOoLYDk8DX7HijmAn8RP7cMcrmr0ZIWdnDjGAnIl+ Jt6qSH747X67FePIYLniXckQGWpjHb8ITv4IS80ZGwCH/DqMRElmh1/mRUKssKMSBXbX 5Q1FBsnSAc3DfZLsQz9Sn0FqobRzxA81ZRQnPHX6rviNNEQRrZqk+Drf+D544iP8Sa5M VAiKMDRfhpLnbZVDAjcloLfgUnfTja65HJG0yeoKJ7jaFIwYRolABWlavXXs6B7M164S suPormdh0x9mpCeki9iC/mYKdWdcK1TfSMqNuu3JrtIvKnUwcJQi5Q8XtJrYIpltCYqS 90Xw== MIME-Version: 1.0 X-Received: by 10.194.234.100 with SMTP id ud4mr2408306wjc.44.1375287711789; Wed, 31 Jul 2013 09:21:51 -0700 (PDT) Received: by 10.216.181.68 with HTTP; Wed, 31 Jul 2013 09:21:51 -0700 (PDT) In-Reply-To: References: Date: Wed, 31 Jul 2013 09:21:51 -0700 Message-ID: Subject: Re: Merging files From: Something Something To: user@pig.apache.org, mapreduce-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=089e01493b688676f304e2d11f60 X-Virus-Checked: Checked by ClamAV on apache.org --089e01493b688676f304e2d11f60 Content-Type: text/plain; charset=ISO-8859-1 Thanks, John. But I don't see an option to specify the # of output files. How does Crush decide how many files to create? Is it only based on file sizes? On Wed, Jul 31, 2013 at 6:28 AM, John Meagher wrote: > Here's a great tool for handling exactly that case: > https://github.com/edwardcapriolo/filecrush > > On Wed, Jul 31, 2013 at 2:40 AM, Something Something > wrote: > > Each bz2 file after merging is about 50Megs. The reducers take about 9 > > minutes. > > > > Note: 'getmerge' is not an option. There isn't enough disk space to do > a > > getmerge on the local production box. Plus we need a scalable solution > as > > these files will get a lot bigger soon. > > > > On Tue, Jul 30, 2013 at 10:34 PM, Ben Juhn wrote: > > > >> How big are your 50 files? How long are the reducers taking? > >> > >> On Jul 30, 2013, at 10:26 PM, Something Something < > >> mailinglists19@gmail.com> wrote: > >> > >> > Hello, > >> > > >> > One of our pig scripts creates over 500 small part files. To save on > >> > namespace, we need to cut down the # of files, so instead of saving > 500 > >> > small files we need to merge them into 50. We tried the following: > >> > > >> > 1) When we set parallel number to 50, the Pig script takes a long > time - > >> > for obvious reasons. > >> > 2) If we use Hadoop Streaming, it puts some garbage values into the > key > >> > field. > >> > 3) We wrote our own Map Reducer program that reads these 500 small > part > >> > files & uses 50 reducers. Basically, the Mappers simply write the > line & > >> > reducers loop thru values & write them out. We set > >> > job.setOutputKeyClass(NullWritable.class) so that the key is not > written > >> to > >> > the output file. This is performing better than Pig. Actually > Mappers > >> run > >> > very fast, but Reducers take some time to complete, but this approach > >> seems > >> > to be working well. > >> > > >> > Is there a better way to do this? What strategy can you think of to > >> > increase speed of reducers. > >> > > >> > Any help in this regard will be greatly appreciated. Thanks. > >> > >> > --089e01493b688676f304e2d11f60 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Thanks, John. =A0But I don't see an option to specify the # of output f= iles. =A0How does Crush decide how many files to create? =A0Is it only base= d on file sizes?

On Wed, Jul 31, 2013 at = 6:28 AM, John Meagher <john.meagher@gmail.com> wrote:
Here's a great tool for handling exactly= that case:
h= ttps://github.com/edwardcapriolo/filecrush

On Wed, Jul 31, 2013 at 2:40 AM, Something Something
<mailinglists19@gmail.com> wrote:
> Each bz2 file after merging is about 50Megs. =A0The reducers take abou= t 9
> minutes.
>
> Note: =A0'getmerge' is not an option. =A0There isn't enoug= h disk space to do a
> getmerge on the local production box. =A0Plus we need a scalable solut= ion as
> these files will get a lot bigger soon.
>
> On Tue, Jul 30, 2013 at 10:34 PM, Ben Juhn <
benjijuhn@gmail.com> wrote:
>
>> How big are your 50 files? =A0How long are the reducers taking? >>
>> On Jul 30, 2013, at 10:26 PM, Something Something <
>> mailinglists19@gmail.c= om> wrote:
>>
>> > Hello,
>> >
>> > One of our pig scripts creates over 500 small part files. =A0= To save on
>> > namespace, we need to cut down the # of files, so instead of = saving 500
>> > small files we need to merge them into 50. =A0We tried the fo= llowing:
>> >
>> > 1) =A0When we set parallel number to 50, the Pig script takes= a long time -
>> > for obvious reasons.
>> > 2) =A0If we use Hadoop Streaming, it puts some garbage values= into the key
>> > field.
>> > 3) =A0We wrote our own Map Reducer program that reads these 5= 00 small part
>> > files & uses 50 reducers. =A0Basically, the Mappers simpl= y write the line &
>> > reducers loop thru values & write them out. =A0We set
>> > job.setOutputKeyClass(NullWritable.class) so that the key is = not written
>> to
>> > the output file. =A0This is performing better than Pig. =A0Ac= tually Mappers
>> run
>> > very fast, but Reducers take some time to complete, but this = approach
>> seems
>> > to be working well.
>> >
>> > Is there a better way to do this? =A0What strategy can you th= ink of to
>> > increase speed of reducers.
>> >
>> > Any help in this regard will be greatly appreciated. =A0Thank= s.
>>
>>

--089e01493b688676f304e2d11f60--