Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2E999DBAD for ; Tue, 9 Oct 2012 16:10:31 +0000 (UTC) Received: (qmail 90563 invoked by uid 500); 9 Oct 2012 16:10:26 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 90489 invoked by uid 500); 9 Oct 2012 16:10:26 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 90482 invoked by uid 99); 9 Oct 2012 16:10:25 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Oct 2012 16:10:25 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [98.139.213.138] (HELO nm18-vm0.bullet.mail.bf1.yahoo.com) (98.139.213.138) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Oct 2012 16:10:17 +0000 Received: from [98.139.212.152] by nm18.bullet.mail.bf1.yahoo.com with NNFMP; 09 Oct 2012 16:09:56 -0000 Received: from [98.139.212.205] by tm9.bullet.mail.bf1.yahoo.com with NNFMP; 09 Oct 2012 16:09:56 -0000 Received: from [127.0.0.1] by omp1014.mail.bf1.yahoo.com with NNFMP; 09 Oct 2012 16:09:56 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 249675.21558.bm@omp1014.mail.bf1.yahoo.com Received: (qmail 89609 invoked by uid 60001); 9 Oct 2012 16:09:56 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1349798996; bh=UNlfvoDh67lFsK3TVj3RmVN55KGV5GKXb9JXb8gw1N4=; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type; b=bffRifArff/JK+kCHxvasYcxQeK2rbChFQcn6B6A7uA386c1Iq6CA559rC8wL07dEIfPprGM04yxy6ElKzgEyseQpCndYRoDYIRnpZYTCLmhcLytPzepolgi0lGLSGaz73sosVCrDwup7ws6/DtsPAkIP82BaF5gaAvMox9xaD4= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type; b=K8uH/Gq827aCQOF9zuHKuXMWjcSArlMz8vure0IeQXImp1JaXKpbI7PjDR2ljcH3cKPEr3RjDTW6RatWPWkdBMWJeSZhpB9M5mXaAnarJYugAR9RDY+LS96b1OvT+nwEkTPx9n8Bbr1muf+bjtvy01k57nHyLE98BLsvXW6XJSo=; X-YMail-OSG: NDroCHQVM1kVm37Sgeak_hRpkiu1kouXYCYzpQ3sW6EzvvY VRK_Oc9mnQ1a9uo0PKw_rtr6xqbwyaKe8Pc6frdbkiIUYxFya0qpVv.q6yjI T3bnULgRiW_LUgOW08GV9heq3.DE3_4r1.xyHujp4ZgDSLvBZnComYyVdogK X6.7cNUR7aWPSApz6ypT8._P21Zrz_dHwcX4LojB852gAlYWo04QvmT9WrL7 Na0QDsVw99ObNNQbiMbPCKCzjpkQEMMLeOueVpRZAp.pCkS8nYF6QfdiVkjc vbNcs1qBlJBvgUKoHpmtlp4bbOQ7pKJ.DSc1MTqZVinwvM4L383pc.WuBM7E 8v0vrfyqpLPJdh8Q9i8nWC9jqcnw4BJ.b3hSBC9eMkt8SxsEpqK0ZnpL9dUF v4gv39iPsxbCVR3oOF4CJFeIFwqKOdclxI1_U_sL9Y1y29yXcWjcNUpdY4SH 7_sKrX8VdFg8fYvGnBkFempNjHn5_69YiaidJ6Q9YMRrsAp4jfcUpoiNlilc - Received: from [98.234.31.8] by web160702.mail.bf1.yahoo.com via HTTP; Tue, 09 Oct 2012 09:09:55 PDT X-Rocket-MIMEInfo: 001.001,QW5uYQoKSSBtaXN1bmRlcnN0b29kIHlvdXIgcHJvYmxlbS4gSSB0aG91Z2h0IHlvdSB3YW50ZWQgdG8gY2hhbmdlIHRoZSBibG9jayBzaXplIG9mIGV2ZXJ5IGZpbGUuIEkgZGlkbicgdCByZWFsaXplIHRoYXQgeW91IHdlcmUgYWdncmVnYXRpbmcgbXVsdGlwbGUgc21hbGwgZmlsZXMgaW50byBkaWZmZXJlbnQsIGFsYmVpdCBzbWFsbGVyLCBzZXQgb2YgbGFyZ2VyIGZpbGVzIG9mIGEgYmlnZ2VyIGJsb2NrIHNpemXCoAp0byBpbXByb3ZlIHBlcmZvcm1hbmNlLsKgCgpJIHRoaW5rIGFzIENocmlzIHN1Z2dlc3QBMAEBAQE- X-Mailer: YahooMailWebService/0.8.123.450 References: <392626559-1349151816-cardhu_decombobulator_blackberry.rim.net-1796317408-@b3.c16.bise7.blackberry> <1349189521.30522.YahooMailNeo@web160703.mail.bf1.yahoo.com> Message-ID: <1349798995.86558.YahooMailNeo@web160702.mail.bf1.yahoo.com> Date: Tue, 9 Oct 2012 09:09:55 -0700 (PDT) From: Raj Vishwanathan Reply-To: Raj Vishwanathan Subject: Re: File block size use To: "user@hadoop.apache.org" Cc: "annalahoud@gmail.com" In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="-2130163251-1464586994-1349798995=:86558" X-Virus-Checked: Checked by ClamAV on apache.org ---2130163251-1464586994-1349798995=:86558 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Anna=0A=0AI misunderstood your problem. I thought you wanted to change the = block size of every file. I didn' t realize that you were aggregating multi= ple small files into different, albeit smaller, set of larger files of a bi= gger block size=A0=0Ato improve performance.=A0=0A=0AI think as Chris sugge= sted you need to have a custom M/R job or you could probably get away with = some scripting magic :-)=0A=0ARaj=0A=0A=0A=0A>_____________________________= ___=0A> From: Anna Lahoud =0A>To: user@hadoop.apache.= org; Raj Vishwanathan =0A>Sent: Tuesday, October 9, 201= 2 7:01 AM=0A>Subject: Re: File block size use=0A> =0A>=0A>Raj - I was not a= ble to get this to work either. =0A>=0A>=0A>On Tue, Oct 2, 2012 at 10:52 AM= , Raj Vishwanathan wrote:=0A>=0A>I haven't tried it but= this should also work=0A>>=0A>>=0A>>=A0hadoop =A0fs =A0-Ddfs.block.size=3D= -cp =A0src dest=0A>>=0A>>=0A>>=0A>>Raj=0A>>=0A>>=0A>>=0A>>= >________________________________=0A>>> From: Anna Lahoud =0A>>>To: user@hadoop.apache.org; bejoy.hadoop@gmail.com =0A>>>Sent: T= uesday, October 2, 2012 7:17 AM=0A>>>=0A>>>Subject: Re: File block size use= =0A>>> =0A>>>=0A>>>=0A>>>Thank you. I will try today.=0A>>>=0A>>>=0A>>>On T= ue, Oct 2, 2012 at 12:23 AM, Bejoy KS wrote:=0A>>>= =0A>>>Hi Anna=0A>>>>=0A>>>>If you want to increase the block size of existi= ng files. You can use a Identity Mapper with no reducer. Set the min and m= ax split sizes to your requirement (512Mb). Use SequenceFileInputFormat and= SequenceFileOutputFormat for your job.=0A>>>>Your job should be done.=0A>>= >>=0A>>>>=0A>>>>Regards=0A>>>>Bejoy KS=0A>>>>=0A>>>>Sent from handheld, ple= ase excuse typos.=0A>>>>________________________________=0A>>>>=0A>>>>From:= Chris Nauroth =0A>>>>Date: Mon, 1 Oct 2012 21:= 12:58 -0700=0A>>>>To: =0A>>>>ReplyTo: user@hadoop.= apache.org =0A>>>>Subject: Re: File block size use=0A>>>>=0A>>>>Hello Anna,= =0A>>>>=0A>>>>=0A>>>>If I understand correctly, you have a set of multiple = sequence files, each much smaller than the desired block size, and you want= to concatenate them into a set of fewer files, each one more closely align= ed to your desired block size. =A0Presumably, the goal is to improve throug= hput of map reduce jobs using those files as input by running fewer map tas= ks, reading a larger number of input records.=0A>>>>=0A>>>>=0A>>>>Whenever = I've had this kind of requirement, I've run a custom map reduce job to impl= ement the file consolidation. =A0In my case, I was typically working with T= extInputFormat (not sequence files). =A0I used IdentityMapper and a custom = reducer that passed through all values but with key set to NullWritable, be= cause the keys (input file offsets in the case of TextInputFormat) were not= valuable data. =A0For my input data, this was sufficient to achieve fairly= even distribution of data across the reducer tasks, and I could reasonably= predict the input data set size, so I could reasonably set the number of r= educers and get decent results. =A0(This may or may not be true for your da= ta set though.)=0A>>>>=0A>>>>=0A>>>>A weakness of this approach is that the= keys must pass from the map tasks to the reduce tasks, only to get discard= ed before writing the final output. =A0Also, the distribution of input reco= rds to reduce tasks is not truly random, and therefore the reduce output fi= les may be uneven in size. =A0This could be solved by writing NullWritable = keys out of the map task instead of the reduce task and writing a custom im= plementation of Partitioner to distribute them randomly.=0A>>>>=0A>>>>=0A>>= >>To expand on this idea, it could be possible to inspect the FileStatus of= each input, sum the values of FileStatus.getLen(), and then use that infor= mation to make a decision about how many reducers to run (and therefore app= roximately set a target output file size). =A0I'm not aware of any built-in= or external utilities that do this for you though.=0A>>>>=0A>>>>=0A>>>>Hop= e this helps,=0A>>>>--Chris=0A>>>>=0A>>>>=0A>>>>On Mon, Oct 1, 2012 at 11:3= 0 AM, Anna Lahoud wrote:=0A>>>>=0A>>>>I would like t= o be able to resize a set of inputs, already in SequenceFile format, to be = larger. =0A>>>>>=0A>>>>>I have tried 'hadoop distcp -Ddfs.block.size=3D$[64= *1024*1024]' and did not get what I expected. The outputs were exactly the = same as the inputs. =0A>>>>>=0A>>>>>I also tried running a job with an Iden= tityMapper and IdentityReducer. Although that approaches a better solution,= it still requires that I know in advance how many reducers I need to get b= etter file sizes. =0A>>>>>=0A>>>>>I was looking at the SequenceFile.Writer = constructors and noticed that there are block size parameters that can be u= sed. Using a writer constructed with a 512MB block size, there is nothing t= hat splits the output and I simply get a single file the size of my inputs.= =0A>>>>>=0A>>>>>What is the current standard for combining sequence files = to create larger files for map-reduce jobs? I have seen code that tracks wh= at it writes into the file, but that seems like the long version. I am hopi= ng there is a shorter path.=0A>>>>>=0A>>>>>Thank you.=0A>>>>>=0A>>>>>Anna= =0A>>>>>=0A>>>>>=0A>>>>=0A>>>=0A>>>=0A>>>=0A>=0A>=0A> ---2130163251-1464586994-1349798995=:86558 Content-Type: text/html; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable
Anna

I misunderstood your p= roblem. I thought you wanted to change the block size of every file. I didn= ' t realize that you were aggregating multiple small files into different, = albeit smaller, set of larger files of a bigger block size 
to improve performance. 

I think as Chris sugge= sted you need to have a custom M/R job or you could probably get away with = some scripting magic :-)

<= div style=3D"color: rgb(0, 0, 0); font-size: 13.600000381469727px; font-fam= ily: arial, helvetica, sans-serif; background-color: transparent; font-styl= e: normal; ">Raj


Fro= m: Anna Lahoud <annalahoud@gmail.com>
To: user@hadoop.apache.org; Raj Vishwana= than <rajvish@yahoo.com>
S= ent: Tuesday, October 9, 2012 7:01 AM
Subject: Re: File block size use

=0A
Raj - I was not able to get this to wor= k either.

On Tue, Oct 2, 201= 2 at 10:52 AM, Raj Vishwanathan <rajvish@yahoo.com> wrote:
=0A
=0AI haven't tried it but this should al= so work

=  hadoop  fs  -Ddfs.block.size=3D<NEW BLOCK SIZE> -cp &= nbsp;src dest
=0A

=0ARaj

=0A

From: Anna Lahoud <annalahoud@gmail.com>
=0A To: user@hadoop.= apache.org; bejoy.hadoop@gm= ail.com
=0A Sent: T= uesday, October 2, 2012 7:17 AM

<= span style=3D"font-weight:bold;">Subject: Re: File block size us= e
=0A

= =0A
Thank you. I will try today.

On Tue, Oct 2, 2012 at 12:= 23 AM, Bejoy KS <bejoy.hadoop@gmail.com> wrote:
=0A=0A
<= /u>
Hi Anna

If you want to increase the block size of existing f= iles. You can use a Identity Mapper with no reducer. Set the min and max s= plit sizes to your requirement (512Mb). Use SequenceFileInputFormat and Seq= uenceFileOutputFormat for your job.
=0A=0AYour job should be done.
Regards
Bejoy KS

Sent from handheld, please excuse typos.=

From: Chris Nauroth <cnauroth@hortonworks.com>=0A
Date:= Mon, 1 Oct 2012 21:12:58 -0700
Subject: Re: File block size use

Hello Anna,

If I understand correctly, = you have a set of multiple sequence files, each much smaller than the desir= ed block size, and you want to concatenate them into a set of fewer files, = each one more closely aligned to your desired block size.  Presumably,= the goal is to improve throughput of map reduce jobs using those files as = input by running fewer map tasks, reading a larger number of input records.=
=0A=0A=0A

Whenever I've had this kind of requireme= nt, I've run a custom map reduce job to implement the file consolidation. &= nbsp;In my case, I was typically working with TextInputFormat (not sequence= files).  I used IdentityMapper and a custom reducer that passed throu= gh all values but with key set to NullWritable, because the keys (input fil= e offsets in the case of TextInputFormat) were not valuable data.  For= my input data, this was sufficient to achieve fairly even distribution of = data across the reducer tasks, and I could reasonably predict the input dat= a set size, so I could reasonably set the number of reducers and get decent= results.  (This may or may not be true for your data set though.)=0A=0A=0A

A weakness of this approach is that the keys= must pass from the map tasks to the reduce tasks, only to get discarded be= fore writing the final output.  Also, the distribution of input record= s to reduce tasks is not truly random, and therefore the reduce output file= s may be uneven in size.  This could be solved by writing NullWritable= keys out of the map task instead of the reduce task and writing a custom i= mplementation of Partitioner to distribute them randomly.
=0A=0A=0A
To expand on this idea, it could be possible to inspect th= e FileStatus of each input, sum the values of FileStatus.getLen(), and then= use that information to make a decision about how many reducers to run (an= d therefore approximately set a target output file size).  I'm not awa= re of any built-in or external utilities that do this for you though.
= =0A=0A=0A

Hope this helps,
--Chris
<= br>
On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <annalahoud@gmail.com> wrote:
=0A=0A=0A
I would like to be able to resiz= e a set of inputs, already in SequenceFile format, to be larger.

I = have tried 'hadoop distcp -Ddfs.block.size=3D$[64*1024*1024]' and did not g= et what I expected. The outputs were exactly the same as the inputs.
= =0A=0A=0A=0A
I also tried running a job with an IdentityMapper and Ident= ityReducer. Although that approaches a better solution, it still requires t= hat I know in advance how many reducers I need to get better file sizes.
I was looking at the SequenceFile.Writer constructors and noticed tha= t there are block size parameters that can be used. Using a writer construc= ted with a 512MB block size, there is nothing that splits the output and I = simply get a single file the size of my inputs.
=0A=0A=0A=0A
What is= the current standard for combining sequence files to create larger files f= or map-reduce jobs? I have seen code that tracks what it writes into the fi= le, but that seems like the long version. I am hoping there is a shorter pa= th.
=0A=0A=0A=0A
Thank you.

Anna=

=0A

=0A=0A

=0A



=0A


---2130163251-1464586994-1349798995=:86558--