Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C169BD7D1 for ; Tue, 9 Oct 2012 16:28:47 +0000 (UTC) Received: (qmail 52268 invoked by uid 500); 9 Oct 2012 16:28:43 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 52176 invoked by uid 500); 9 Oct 2012 16:28:42 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 52168 invoked by uid 99); 9 Oct 2012 16:28:42 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Oct 2012 16:28:42 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of annalahoud@gmail.com designates 209.85.220.176 as permitted sender) Received: from [209.85.220.176] (HELO mail-vc0-f176.google.com) (209.85.220.176) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Oct 2012 16:28:35 +0000 Received: by mail-vc0-f176.google.com with SMTP id gb22so8170639vcb.35 for ; Tue, 09 Oct 2012 09:28:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=u0fLSKBanM+qGLY6vEpY5e0A/OE6p1rztPFlqxDzBKw=; b=gtxum4mmI52aFvRsYwgKJQ1vfa2/UgE5fYW+BJYQKratr3nf1fstqSwnf8zT6Ba+zB 2IbpK5z3+zLvMRNE7N0krRY0inXcnNl/gTohfTz7armN4aphgg6FOuwbqs10+NhF8Q2t dtnHql4aJIhlMdwTmFaUuZ72YBSxCsc9aGrxGfVul38VzoBkt0Hz/jJazjqMWrSCm0GS SFszqu3osSXzUj7DxIJYhFkW3EFswRtmyCpYcTqsHxbsVGiErsB9SHvyqtOOq7A0XisL nsK1Qlxqf6CBPNsOVHl3kvnms4SkCyjw4v+HXO3KIOeOCYWdiLUlJ6Gw8OSUlaLo/48A IqIQ== MIME-Version: 1.0 Received: by 10.52.72.104 with SMTP id c8mr9922924vdv.20.1349800094354; Tue, 09 Oct 2012 09:28:14 -0700 (PDT) Received: by 10.220.115.76 with HTTP; Tue, 9 Oct 2012 09:28:14 -0700 (PDT) In-Reply-To: <1349798995.86558.YahooMailNeo@web160702.mail.bf1.yahoo.com> References: <392626559-1349151816-cardhu_decombobulator_blackberry.rim.net-1796317408-@b3.c16.bise7.blackberry> <1349189521.30522.YahooMailNeo@web160703.mail.bf1.yahoo.com> <1349798995.86558.YahooMailNeo@web160702.mail.bf1.yahoo.com> Date: Tue, 9 Oct 2012 12:28:14 -0400 Message-ID: Subject: Re: File block size use From: Anna Lahoud To: Raj Vishwanathan Cc: "user@hadoop.apache.org" Content-Type: multipart/alternative; boundary=bcaec50162bd244d6f04cba2d33a --bcaec50162bd244d6f04cba2d33a Content-Type: text/plain; charset=ISO-8859-1 You are correct that I want to create a small number of large files from a large number of small files. The only solution that has worked, as you say, has been a custom M/R job. Thank you for the help and ideas. On Tue, Oct 9, 2012 at 12:09 PM, Raj Vishwanathan wrote: > Anna > > I misunderstood your problem. I thought you wanted to change the block > size of every file. I didn' t realize that you were aggregating multiple > small files into different, albeit smaller, set of larger files of a bigger > block size > to improve performance. > > I think as Chris suggested you need to have a custom M/R job or you could > probably get away with some scripting magic :-) > > Raj > > ------------------------------ > *From:* Anna Lahoud > *To:* user@hadoop.apache.org; Raj Vishwanathan > *Sent:* Tuesday, October 9, 2012 7:01 AM > > *Subject:* Re: File block size use > > Raj - I was not able to get this to work either. > > On Tue, Oct 2, 2012 at 10:52 AM, Raj Vishwanathan wrote: > > I haven't tried it but this should also work > > hadoop fs -Ddfs.block.size= -cp src dest > > Raj > > ------------------------------ > *From:* Anna Lahoud > *To:* user@hadoop.apache.org; bejoy.hadoop@gmail.com > *Sent:* Tuesday, October 2, 2012 7:17 AM > > *Subject:* Re: File block size use > > Thank you. I will try today. > > On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS wrote: > > ** > Hi Anna > > If you want to increase the block size of existing files. You can use a > Identity Mapper with no reducer. Set the min and max split sizes to your > requirement (512Mb). Use SequenceFileInputFormat and > SequenceFileOutputFormat for your job. > Your job should be done. > > Regards > Bejoy KS > > Sent from handheld, please excuse typos. > ------------------------------ > *From: * Chris Nauroth > *Date: *Mon, 1 Oct 2012 21:12:58 -0700 > *To: * > *ReplyTo: * user@hadoop.apache.org > *Subject: *Re: File block size use > > Hello Anna, > > If I understand correctly, you have a set of multiple sequence files, each > much smaller than the desired block size, and you want to concatenate them > into a set of fewer files, each one more closely aligned to your desired > block size. Presumably, the goal is to improve throughput of map reduce > jobs using those files as input by running fewer map tasks, reading a > larger number of input records. > > Whenever I've had this kind of requirement, I've run a custom map reduce > job to implement the file consolidation. In my case, I was typically > working with TextInputFormat (not sequence files). I used IdentityMapper > and a custom reducer that passed through all values but with key set to > NullWritable, because the keys (input file offsets in the case of > TextInputFormat) were not valuable data. For my input data, this was > sufficient to achieve fairly even distribution of data across the reducer > tasks, and I could reasonably predict the input data set size, so I could > reasonably set the number of reducers and get decent results. (This may or > may not be true for your data set though.) > > A weakness of this approach is that the keys must pass from the map tasks > to the reduce tasks, only to get discarded before writing the final output. > Also, the distribution of input records to reduce tasks is not truly > random, and therefore the reduce output files may be uneven in size. This > could be solved by writing NullWritable keys out of the map task instead of > the reduce task and writing a custom implementation of Partitioner to > distribute them randomly. > > To expand on this idea, it could be possible to inspect the FileStatus of > each input, sum the values of FileStatus.getLen(), and then use that > information to make a decision about how many reducers to run (and > therefore approximately set a target output file size). I'm not aware of > any built-in or external utilities that do this for you though. > > Hope this helps, > --Chris > > On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud wrote: > > I would like to be able to resize a set of inputs, already in SequenceFile > format, to be larger. > > I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not > get what I expected. The outputs were exactly the same as the inputs. > > I also tried running a job with an IdentityMapper and IdentityReducer. > Although that approaches a better solution, it still requires that I know > in advance how many reducers I need to get better file sizes. > > I was looking at the SequenceFile.Writer constructors and noticed that > there are block size parameters that can be used. Using a writer > constructed with a 512MB block size, there is nothing that splits the > output and I simply get a single file the size of my inputs. > > What is the current standard for combining sequence files to create larger > files for map-reduce jobs? I have seen code that tracks what it writes into > the file, but that seems like the long version. I am hoping there is a > shorter path. > > Thank you. > > Anna > > > > > > > > > --bcaec50162bd244d6f04cba2d33a Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable You are correct that I want to create a small number of large files from a = large number of small files. The only solution that has worked, as you say,= has been a custom M/R job. Thank you for the help and ideas.

On Tue, Oct 9, 2012 at 12:09 PM, Raj Vishwanathan <rajvish@yahoo.com&g= t; wrote:
<= div>Anna

I misunderstood your problem. I thought you wanted to change the= block size of every file. I didn' t realize that you were aggregating = multiple small files into different, albeit smaller, set of larger files of= a bigger block size=A0
to improve p= erformance.=A0

I think as Chris suggested you need to have a custom M/R job or = you could probably get away with some scripting magic :-)

<= /div>
Raj


= From: Anna Lahoud <annalahoud@gmail.com>
To: user@hadoop.apache.org; Raj = Vishwanathan <raj= vish@yahoo.com>
Sent: Tuesday, October 9, 2= 012 7:01 AM

= Subject: Re: File block size use
<= div>

Raj - I was not able to get this to work either.

On Tue, = Oct 2, 2012 at 10:52 AM, Raj Vishwanathan <rajvish@yahoo.= com> wrote:
I haven't tried it but this should also work

=A0hadoop =A0fs =A0-Ddfs.block.size=3D<= NEW BLOCK SIZE> -cp =A0src dest

Raj

From: Anna Lah= oud <annalahoud@gmail.com>
To: user@hadoop.apache.or= g; bejoy.hadoop@gmail.com
Sent: Tuesday, October 2, 2= 012 7:17 AM

Subject: Re: File block size use

Thank you. I will try today.

On Tue, Oct 2, 2012 at 12:23 = AM, Bejoy KS <bejoy.hadoop@gmail.com> = wrote:
Hi Anna

If you want to increase the block size = of existing files. You can use a Identity Mapper with no reducer. Set the = min and max split sizes to your requirement (512Mb). Use SequenceFileInputF= ormat and SequenceFileOutputFormat for your job.
Your job should be done.

Regards
Bejoy KS

Sent from = handheld, please excuse typos.

From: Chris Nauroth &l= t;cnauroth@hortonworks.com>
Date: Mon, 1 Oct 2012 21:12:58 -0700
Subject: Re: File block size use

=
Hello Anna,

If I understand correctly, you have a = set of multiple sequence files, each much smaller than the desired block si= ze, and you want to concatenate them into a set of fewer files, each one mo= re closely aligned to your desired block size. =A0Presumably, the goal is t= o improve throughput of map reduce jobs using those files as input by runni= ng fewer map tasks, reading a larger number of input records.

Whenever I've had this kind of requirement, I'v= e run a custom map reduce job to implement the file consolidation. =A0In my= case, I was typically working with TextInputFormat (not sequence files). = =A0I used IdentityMapper and a custom reducer that passed through all value= s but with key set to NullWritable, because the keys (input file offsets in= the case of TextInputFormat) were not valuable data. =A0For my input data,= this was sufficient to achieve fairly even distribution of data across the= reducer tasks, and I could reasonably predict the input data set size, so = I could reasonably set the number of reducers and get decent results. =A0(T= his may or may not be true for your data set though.)

A weakness of this approach is that the keys must pass = from the map tasks to the reduce tasks, only to get discarded before writin= g the final output. =A0Also, the distribution of input records to reduce ta= sks is not truly random, and therefore the reduce output files may be uneve= n in size. =A0This could be solved by writing NullWritable keys out of the = map task instead of the reduce task and writing a custom implementation of = Partitioner to distribute them randomly.

To expand on this idea, it could be possible to inspect= the FileStatus of each input, sum the values of FileStatus.getLen(), and t= hen use that information to make a decision about how many reducers to run = (and therefore approximately set a target output file size). =A0I'm not= aware of any built-in or external utilities that do this for you though.

Hope this helps,
--Chris

=
On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud &l= t;annalahoud@gmail.com> wrote:
I would like to be able to resize a set of inputs, already in Sequ= enceFile format, to be larger.

I have tried 'hadoop distcp -Ddf= s.block.size=3D$[64*1024*1024]' and did not get what I expected. The ou= tputs were exactly the same as the inputs.

I also tried running a job with an IdentityMapper and IdentityReducer. = Although that approaches a better solution, it still requires that I know i= n advance how many reducers I need to get better file sizes.

I was = looking at the SequenceFile.Writer constructors and noticed that there are = block size parameters that can be used. Using a writer constructed with a 5= 12MB block size, there is nothing that splits the output and I simply get a= single file the size of my inputs.

What is the current standard for combining sequence files to create lar= ger files for map-reduce jobs? I have seen code that tracks what it writes = into the file, but that seems like the long version. I am hoping there is a= shorter path.

Thank you.

Anna





=



=
--bcaec50162bd244d6f04cba2d33a--