Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F1EA4F3F5 for ; Mon, 1 Apr 2013 12:53:33 +0000 (UTC) Received: (qmail 93212 invoked by uid 500); 1 Apr 2013 12:53:28 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 93073 invoked by uid 500); 1 Apr 2013 12:53:28 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 93052 invoked by uid 99); 1 Apr 2013 12:53:27 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Apr 2013 12:53:27 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of himanish@gmail.com designates 209.85.215.51 as permitted sender) Received: from [209.85.215.51] (HELO mail-la0-f51.google.com) (209.85.215.51) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Apr 2013 12:53:21 +0000 Received: by mail-la0-f51.google.com with SMTP id fo13so2089239lab.10 for ; Mon, 01 Apr 2013 05:53:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=5JnQntFh9w3DIuk5t4fKVfgKNL1gJtAo351jPdPutD8=; b=fkDrTXkyl/jWQmNSgLlE6rcnA2fGho9EQcdCYtVt0bG302lIx8inm0UNNzs3N0XCYi 0CNOFdfBfk7k1/NGpT+e4vjXZIUAzLx+P+9XUMbPC9f94/giCvtE7vu2qSShtj5M9tUy 0TNF75Y+SKZEA63pGVKhZh93CUdg5+wEuzl8wJ0UvkeKX0oFJ2sczpFDgpmNW2ZAk5Ye oxQ8YuAxMGwHm3T4DRhowIrP6XSrKQ9Y4B0cPFBgSYvLa5Ro6PNi34+JGJfD7LRGCHvC gw0lQZ7LtLopWEKqbWk+GZDuqpvKlsBCUGCGK8bf0D9N63kjlPOp7001gkrMYdmgS+ZB b9cA== MIME-Version: 1.0 X-Received: by 10.112.143.10 with SMTP id sa10mr1628104lbb.36.1364820780834; Mon, 01 Apr 2013 05:53:00 -0700 (PDT) Received: by 10.112.131.132 with HTTP; Mon, 1 Apr 2013 05:53:00 -0700 (PDT) In-Reply-To: <036c01ce2dae$d4170180$7c450480$@yahoo.com> References: <036c01ce2dae$d4170180$7c450480$@yahoo.com> Date: Mon, 1 Apr 2013 08:53:00 -0400 Message-ID: Subject: Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput From: Himanish Kushary To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=089e0112c7a0d2e64004d94c192f X-Virus-Checked: Checked by ClamAV on apache.org --089e0112c7a0d2e64004d94c192f Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable I was able to transfer the data to S3 successfully with the earlier mentioned work-around.Also I was able to max out our available upload bandwidth.I could get average around 10 MB/s from the cluster. I ran the s3distcp jobs with the default timeout and did not face any issues. Thanks all for the help. Himanish On Sat, Mar 30, 2013 at 9:26 PM, David Parks wrote= : > 4-20MB/sec are common transfer rates from S3 to **1** local AWS box, this > was, of course, a cluster, and s3distcp is specifically designed to take > advantage of the cluster, so it was a 45 minute job to transfer the 1.5 T= B > to the full cluster of, I forget how many servers I had at the time, mayb= e > 15-30 m1.xlarge. The numbers are rough, I could be mistaken and it was 1 = =BD > hours to do the transfer (but I recall 45 min), in either case the s3dist= cp > job ran longer than the task timeout period, which was the real point I w= as > focusing on.**** > > ** ** > > I seem to recall needing to re-package their jar as well, but for > different reasons, they package in some other open source utilities and I > had version conflicts, so might want to watch for that.**** > > ** ** > > I=92ve never seen this ProgressableResettableBufferedFileInputStream, so = I > can=92t offer much more advise on that one.**** > > ** ** > > Good luck! Let us know how it turns out.**** > > Dave**** > > ** ** > > ** ** > > *From:* Himanish Kushary [mailto:himanish@gmail.com] > *Sent:* Friday, March 29, 2013 9:57 PM > > *To:* user@hadoop.apache.org > *Subject:* Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput*= * > ** > > ** ** > > Yes you are right CDH4 is the 2.x line, but I even checked in the javadoc= s > for 1.0.4 branch (could not find 1.0.3 API's so used > http://hadoop.apache.org/docs/r1.0.4/api/index.html) but did not find the= "ProgressableResettableBufferedFileInputStream" > class.Not sure how it is present in the hadoop-core.jar in Amazon EMR.***= * > > ** ** > > In the meantime I have come out with a dirty workaround by extracting the > class from the Amazon jar and packaging it into its own separate jar.I am > actually able to run the s3distcp now on local CDH4 using amazon's jar an= d > transfer from my local hadoop to Amazon S3.**** > > ** ** > > But the real issue is the throughput. You mentioned that you had > transferred 1.5 TB in 45 mins which comes to around 583 MB/s. I am barely > getting 4 MB/s upload speed !! How did you get 100x times speed compared = to > me ? Could you please share any settings/tweaks that you may have done > to achieve this. Were you on some very specific high bandwidth network ? > Was is between HDFS on EC2 and amazon S3 ?**** > > ** ** > > Looking forward to hear from you.**** > > ** ** > > Thanks**** > > Himanish**** > > ** ** > > On Fri, Mar 29, 2013 at 10:34 AM, David Parks > wrote:**** > > CDH4 can be either 1.x or2.x hadoop, are you using the 2.x line? I've use= d > it primarily with 1.0.3, which is what AWS uses, so I presume that's what > it's tested on.**** > > > > Himanish Kushary wrote:**** > > Thanks Dave.**** > > ** ** > > I had already tried using the s3distcp jar. But got stuck on the below > error,which made me think that this is something specific to Amazon hadoo= p > distribution.**** > > ** ** > > Exception in thread "Thread-28" java.lang.NoClassDefFoundError: > org/apache/hadoop/fs/s3native/ProgressableResettableBufferedFileInputStre= am > **** > > ** ** > > Also, I noticed that the Amazon EMR hadoop-core.jar has this class but it > is not present on the CDH4 (my local env) hadoop jars.**** > > ** ** > > Could you suggest how I could get around this issue. One option could be > using the amazon specific jars but then probably I would need to get all > the jars ( else it could cause version mismatch errors for HDFS - > NoSuchMethodError etc etc ) **** > > ** ** > > Appreciate your help regarding this.**** > > ** ** > > - Himanish**** > > ** ** > > ** ** > > On Fri, Mar 29, 2013 at 1:41 AM, David Parks > wrote:**** > > None of that complexity, they distribute the jar publicly (not the source= , > but the jar). You can just add this to your libjars: s3n://*region* > .elasticmapreduce/libs/s3distcp/*latest*/s3distcp.jar**** > > **** > > No VPN or anything, if you can access the internet you can get to S3. ***= * > > **** > > Follow their docs here: > http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEM= R_s3distcp.html > **** > > **** > > Doesn=92t matter where you=92re Hadoop instance is running.**** > > **** > > Here=92s an example of code/parameters I used to run it from within anoth= er > Tool, it=92s a Tool, so it=92s actually designed to run from the Hadoop c= ommand > line normally.**** > > **** > > ToolRunner.*run*(getConf(), *new* S3DistCp(), *new* String[] {**** > > "--src", "/frugg/image-cache-stage2/",**** > > "--srcPattern", ".*part.*",**** > > "--dest", "s3n://fruggmapreduce/results-"+env+"/= "+ JobUtils. > *isoDate* + "/output/itemtable/", **** > > "--s3Endpoint", "s3.amazonaws.com" });**** > > **** > > Watch the =93srcPattern=94, make sure you have that leading `.*`, that on= e > threw me for a loop once.**** > > **** > > Dave**** > > **** > > **** > > *From:* Himanish Kushary [mailto:himanish@gmail.com] > *Sent:* Thursday, March 28, 2013 5:51 PM > *To:* user@hadoop.apache.org > *Subject:* Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput*= * > ** > > **** > > Hi Dave,**** > > **** > > Thanks for your reply. Our hadoop instance is inside our corporate > LAN.Could you please provide some details on how i could use the s3distcp > from amazon to transfer data from our on-premises hadoop to amazon s3. > Wouldn't some kind of VPN be needed between the Amazon EMR instance and o= ur > on-premises hadoop instance ? Did you mean use the jar from amazon on our > local server ?**** > > **** > > Thanks**** > > On Thu, Mar 28, 2013 at 3:56 AM, David Parks > wrote:**** > > Have you tried using s3distcp from amazon? I used it many times to > transfer 1.5TB between S3 and Hadoop instances. The process took 45 min, > well over the 10min timeout period you=92re running into a problem on.***= * > > **** > > Dave**** > > **** > > **** > > *From:* Himanish Kushary [mailto:himanish@gmail.com] > *Sent:* Thursday, March 28, 2013 10:54 AM > *To:* user@hadoop.apache.org > *Subject:* Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput**** > > **** > > Hello,**** > > **** > > I am trying to transfer around 70 GB of files from HDFS to Amazon S3 usin= g > the distcp utility.There are aaround 2200 files distributed over 15 > directories.The max individual file size is approx 50 MB.**** > > **** > > The distcp mapreduce job keeps on failing with this error **** > > **** > > "Task attempt_201303211242_0260_m_000005_0 failed to report status for > 600 seconds. Killing!" **** > > **** > > and in the task attempt logs I can see lot of INFO messages like **** > > **** > > "INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception > (java.io.IOException) caught when processing request: Resetting to invali= d > mark"**** > > **** > > I am thinking either transferring individual folders instead of the entir= e > 70 GB folders as a workaround or as another option increasing the " > mapred.task.timeout" parameter to something like 6-7 hour ( as the avg > rate of transfer to S3 seems to be 5 MB/s).Is there any other better > option to increase the throughput for transferring bulk data from HDFS to > S3 ? Looking forward for suggestions.**** > > **** > > **** > > -- > Thanks & Regards > Himanish **** > > > > **** > > **** > > -- > Thanks & Regards > Himanish **** > > > > **** > > ** ** > > -- > Thanks & Regards > Himanish **** > > > > **** > > ** ** > > -- > Thanks & Regards > Himanish **** > --=20 Thanks & Regards Himanish --089e0112c7a0d2e64004d94c192f Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable
I was able to transfer the data to S3 successfully with th= e earlier mentioned work-around.Also I was able to max out our available up= load bandwidth.I could get average around 10 MB/s from the cluster.
I ran the s3distcp jobs with the default timeout and did n= ot face any issues.

Thanks all for the= help.=A0

Himanish


On Sat, Mar 30, 2013 at 9:26 PM, David P= arks <davidparks21@yahoo.com> wrote:

4-20MB/sec are common transfer rates from S3= to *1* local AWS box, this was, of course, a cluster, and s3distcp = is specifically designed to take advantage of the cluster, so it was a 45 m= inute job to transfer the 1.5 TB to the full cluster of, I forget how many = servers I had at the time, maybe 15-30 m1.xlarge. The numbers are rough, I = could be mistaken and it was 1 =BD hours to do the transfer (but I recall 4= 5 min), in either case the s3distcp job ran longer than the task timeout pe= riod, which was the real point I was focusing on.

=A0<= /p>

I seem to recall needi= ng to re-package their jar as well, but for different reasons, they package= in some other open source utilities and I had version conflicts, so might = want to watch for that.

=A0<= /p>

I=92ve never seen this= ProgressableResettableBufferedFileInputStream, so I can=92t offer much mor= e advise on that one.

=A0<= /p>

Good luck! Let us know= how it turns out.

Dave=

=A0

=A0<= /p>

From: Himan= ish Kushary [mailto:himanish@gmail.com]
Sent: Friday, March 29, 2013 9:57 PM


To: user@hadoop.apache.org
Subject: Re: Hadoop distcp from CDH4= to Amazon S3 - Improve Throughput

=A0

Yes you are right CDH4 is the 2.x line, bu= t I even checked in the javadocs for 1.0.4 branch (could=A0not find 1.0.3 A= PI's so used http://hadoop.apache.org/docs/r1.0.4/api/index.html= ) but did not find the "ProgressableResettableBufferedFileIn= putStream" class.Not sure how it is present in the hadoop-core.jar in = Amazon EMR.

=A0

In the meantime I have come out with a dirty workaround by extracti= ng the class from the Amazon jar and packaging it into its own separate jar= .I am actually able to run the s3distcp now on local CDH4 using amazon'= s jar and transfer from my local hadoop to Amazon S3.

=A0

But the real issue is the throughput. You mentioned that you= had transferred 1.5 TB in 45 mins which comes to around 583 MB/s. I am bar= ely getting 4 MB/s upload speed !! How did you get 100x times speed compare= d to me ? Could you please share any settings/tweaks that you may have done= to=A0achieve=A0this. Were you on some very specific high bandwidth network= ? Was is between HDFS on EC2 and amazon S3 ?

=A0

Looking forward to hear from you.

=A0

Thanks

Himanish

=A0

On Fri, Mar 29, 2013 at 10:34 AM, David Parks <<= a href=3D"mailto:davidparks21@yahoo.com" target=3D"_blank">davidparks21@yah= oo.com> wrote:

CDH4 can be either 1.x or2.x hadoop, are you using t= he 2.x line? I've used it primarily with 1.0.3, which is what AWS uses,= so I presume that's what it's tested on.



Himanish Kush= ary <himanish@gm= ail.com> wrote:

Thanks = Dave.

=A0

I had already tried using the s3distcp jar. But got stuck on the be= low error,which made me think that this is something specific to Amazon had= oop distribution.

=A0

Exception in thread "Thread-28" java.l= ang.NoClassDefFoundError: org/apache/hadoop/fs/s3native/ProgressableResetta= bleBufferedFileInputStream=A0

=A0

Also, I noticed that the Amazon EMR hadoop-core.jar has this= class but it is not present on the CDH4 (my local env) hadoop jars.=

=A0

Could you suggest how I could get around this issue. One opt= ion could be using the amazon specific jars but then probably I would need = to get all the jars ( else it could cause version mismatch errors for HDFS = - NoSuchMethodError etc etc )=A0

=A0

Appreciate your help regarding this.

=

=A0

- Himanish

=A0

=A0

On Fri, Mar 29, 2013 at 1:41 AM, David Parks <davidparks21@yahoo.com= > wrote:

None of that complexity, they distr= ibute the jar publicly (not the source, but the jar). You can just add this= to your libjars: s3n://region.elasticmapre= duce/libs/s3distcp/latest/s3distcp.jar

=A0

No= VPN or anything, if you can access the internet you can get to S3. =

=A0

Fo= llow their docs here: http://do= cs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.= html

=A0

Doesn=92t matter where you=92re Hadoop instance is running.

=A0

Here=92s an example of code/parameters I used to run it from w= ithin another Tool, it=92s a Tool, so it=92s actually designed to run from = the Hadoop command line normally.

=A0

=A0= =A0=A0=A0=A0=A0 ToolRunner.run(getConf(), new S3DistCp(), new String[] {

=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 "--src", =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 "/frugg/image-cache-stage2/",=

=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 "--srcPattern",=A0=A0=A0=A0=A0 ".*part.*",

=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 "--dest", =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 "s3n://fruggmapreduce/results-"+env+"/&qu= ot; + JobUtils.isoDate += "/output/itemtable/",

=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 "--s3Endpoint",=A0=A0=A0=A0=A0 "s3.am= azonaws.com"=A0=A0=A0=A0=A0=A0=A0=A0 });

=A0

Watch the =93srcPattern=94, make sure you have that leading `.= *`, that one threw me for a loop once.

=A0

Dave

=A0

=A0

From: Himanish Kushary [mailto:himanish@gmail.com= ]
Sent: Thursday, March 28, 2013 5:51 PM
To: user@hadoop.apache.org
= Subject: Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughp= ut

=A0

Hi Dave,

=A0<= u>

Thanks for your reply. Our hadoop instance i= s inside our corporate LAN.Could you please provide some details on how i c= ould use the s3distcp from amazon to transfer data from our on-premises had= oop to amazon s3. Wouldn't some kind of VPN be needed between the Amazo= n EMR instance and our on-premises hadoop instance ? Did you mean use the j= ar from amazon on our local server ?

=A0

Thanks

On Thu, Mar 28, 2013 at 3:56 AM, David= Parks <davi= dparks21@yahoo.com> wrote:

Have you tried using s3distcp from = amazon? I used it many times to transfer 1.5TB between S3 and Hadoop instan= ces. The process took 45 min, well over the 10min timeout period you=92re r= unning into a problem on.

=A0

Dave

=A0

=A0

From: Himanish Kushary [mailto:himanish@gmail.com= ]
Sent: Thursday, March 28, 2013 10:54 AM
To: user@hadoop.apache.orgSubject: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput<= /span>

=A0

Hello,

=A0=

I am trying to transfer around 70 GB= of files from HDFS to Amazon S3 using the distcp utility.There are aaround= 2200 files distributed over 15 directories.The max individual file size is= approx 50 MB.

=A0

The distcp mapreduce job keeps on failing wi= th this error=A0

=A0

"Task at= tempt_201303211242_0260_m_000005_0 failed to report status for 600 seconds.= Killing!" =A0

=A0

and in the task attempt logs I can see lot of= INFO messages like=A0

=A0

"INFO or= g.apache.commons.httpclient.HttpMethodDirector: I/O exception (java.io.IOEx= ception) caught when processing request: Resetting to invalid mark"

=A0

I am thinking either transferring individual = folders instead of the entire 70 GB folders as a workaround or as another o= ption increasing the=A0"mapred.task.timeout"=A0parameter to something like 6-7 hour ( as the avg rate of tra= nsfer to S3 seems to be 5 MB/s).Is there an= y other better option to increase the throughput for transferring bulk data= from HDFS to S3 ? =A0Looking forward for suggestions.=

=A0

=A0

--
Thanks & Regards
Himanish

<= /div>



=A0<= /u>

--
Thanks & Regards
Himanish



=A0

--
Thanks & Regards
Himanish



<= /u>

=A0

--
Thanks & Regards
Himanish




= --
Thanks & Regards
Himanish --089e0112c7a0d2e64004d94c192f--