Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of himanish@gmail.com designates
 209.85.215.51 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <036c01ce2dae$d4170180$7c450480$@yahoo.com>
References: <l06bv3cgwxuaerlu2sypmxk6.1364567646793@email.android.com>
	<CAB1PEPK_R_-OpqSr70L6dFiENv6ZB4CYuHEUYa5ikwSNWw3GTw@mail.gmail.com>
	<036c01ce2dae$d4170180$7c450480$@yahoo.com>
Date: Mon, 1 Apr 2013 08:53:00 -0400
Message-ID: 
 <CAB1PEP+EgDPg4mFx_WhGYpmdTauDOoYaA1GAahgrTdtPRh58xw@mail.gmail.com>
Subject: Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput
From: Himanish Kushary <himanish@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=089e0112c7a0d2e64004d94c192f

--089e0112c7a0d2e64004d94c192f
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

I was able to transfer the data to S3 successfully with the earlier
mentioned work-around.Also I was able to max out our available upload
bandwidth.I could get average around 10 MB/s from the cluster.

I ran the s3distcp jobs with the default timeout and did not face any
issues.

Thanks all for the help.

Himanish


On Sat, Mar 30, 2013 at 9:26 PM, David Parks <davidparks21@yahoo.com> wrote=
:

> 4-20MB/sec are common transfer rates from S3 to **1** local AWS box, this
> was, of course, a cluster, and s3distcp is specifically designed to take
> advantage of the cluster, so it was a 45 minute job to transfer the 1.5 T=
B
> to the full cluster of, I forget how many servers I had at the time, mayb=
e
> 15-30 m1.xlarge. The numbers are rough, I could be mistaken and it was 1 =
=BD
> hours to do the transfer (but I recall 45 min), in either case the s3dist=
cp
> job ran longer than the task timeout period, which was the real point I w=
as
> focusing on.****
>
> ** **
>
> I seem to recall needing to re-package their jar as well, but for
> different reasons, they package in some other open source utilities and I
> had version conflicts, so might want to watch for that.****
>
> ** **
>
> I=92ve never seen this ProgressableResettableBufferedFileInputStream, so =
I
> can=92t offer much more advise on that one.****
>
> ** **
>
> Good luck! Let us know how it turns out.****
>
> Dave****
>
> ** **
>
> ** **
>
> *From:* Himanish Kushary [mailto:himanish@gmail.com]
> *Sent:* Friday, March 29, 2013 9:57 PM
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput*=
*
> **
>
> ** **
>
> Yes you are right CDH4 is the 2.x line, but I even checked in the javadoc=
s
> for 1.0.4 branch (could not find 1.0.3 API's so used
> http://hadoop.apache.org/docs/r1.0.4/api/index.html) but did not find the=
"ProgressableResettableBufferedFileInputStream"
> class.Not sure how it is present in the hadoop-core.jar in Amazon EMR.***=
*
>
> ** **
>
> In the meantime I have come out with a dirty workaround by extracting the
> class from the Amazon jar and packaging it into its own separate jar.I am
> actually able to run the s3distcp now on local CDH4 using amazon's jar an=
d
> transfer from my local hadoop to Amazon S3.****
>
> ** **
>
> But the real issue is the throughput. You mentioned that you had
> transferred 1.5 TB in 45 mins which comes to around 583 MB/s. I am barely
> getting 4 MB/s upload speed !! How did you get 100x times speed compared =
to
> me ? Could you please share any settings/tweaks that you may have done
> to achieve this. Were you on some very specific high bandwidth network ?
> Was is between HDFS on EC2 and amazon S3 ?****
>
> ** **
>
> Looking forward to hear from you.****
>
> ** **
>
> Thanks****
>
> Himanish****
>
> ** **
>
> On Fri, Mar 29, 2013 at 10:34 AM, David Parks <davidparks21@yahoo.com>
> wrote:****
>
> CDH4 can be either 1.x or2.x hadoop, are you using the 2.x line? I've use=
d
> it primarily with 1.0.3, which is what AWS uses, so I presume that's what
> it's tested on.****
>
>
>
> Himanish Kushary <himanish@gmail.com> wrote:****
>
> Thanks Dave.****
>
> ** **
>
> I had already tried using the s3distcp jar. But got stuck on the below
> error,which made me think that this is something specific to Amazon hadoo=
p
> distribution.****
>
> ** **
>
> Exception in thread "Thread-28" java.lang.NoClassDefFoundError:
> org/apache/hadoop/fs/s3native/ProgressableResettableBufferedFileInputStre=
am
>  ****
>
> ** **
>
> Also, I noticed that the Amazon EMR hadoop-core.jar has this class but it
> is not present on the CDH4 (my local env) hadoop jars.****
>
> ** **
>
> Could you suggest how I could get around this issue. One option could be
> using the amazon specific jars but then probably I would need to get all
> the jars ( else it could cause version mismatch errors for HDFS -
> NoSuchMethodError etc etc ) ****
>
> ** **
>
> Appreciate your help regarding this.****
>
> ** **
>
> - Himanish****
>
> ** **
>
> ** **
>
> On Fri, Mar 29, 2013 at 1:41 AM, David Parks <davidparks21@yahoo.com>
> wrote:****
>
> None of that complexity, they distribute the jar publicly (not the source=
,
> but the jar). You can just add this to your libjars: s3n://*region*
> .elasticmapreduce/libs/s3distcp/*latest*/s3distcp.jar****
>
>  ****
>
> No VPN or anything, if you can access the internet you can get to S3. ***=
*
>
>  ****
>
> Follow their docs here:
> http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEM=
R_s3distcp.html
> ****
>
>  ****
>
> Doesn=92t matter where you=92re Hadoop instance is running.****
>
>  ****
>
> Here=92s an example of code/parameters I used to run it from within anoth=
er
> Tool, it=92s a Tool, so it=92s actually designed to run from the Hadoop c=
ommand
> line normally.****
>
>  ****
>
>        ToolRunner.*run*(getConf(), *new* S3DistCp(), *new* String[] {****
>
>               "--src",             "/frugg/image-cache-stage2/",****
>
>               "--srcPattern",      ".*part.*",****
>
>               "--dest",            "s3n://fruggmapreduce/results-"+env+"/=
"+ JobUtils.
> *isoDate* + "/output/itemtable/", ****
>
>               "--s3Endpoint",      "s3.amazonaws.com"         });****
>
>  ****
>
> Watch the =93srcPattern=94, make sure you have that leading `.*`, that on=
e
> threw me for a loop once.****
>
>  ****
>
> Dave****
>
>  ****
>
>  ****
>
> *From:* Himanish Kushary [mailto:himanish@gmail.com]
> *Sent:* Thursday, March 28, 2013 5:51 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput*=
*
> **
>
>  ****
>
> Hi Dave,****
>
>  ****
>
> Thanks for your reply. Our hadoop instance is inside our corporate
> LAN.Could you please provide some details on how i could use the s3distcp
> from amazon to transfer data from our on-premises hadoop to amazon s3.
> Wouldn't some kind of VPN be needed between the Amazon EMR instance and o=
ur
> on-premises hadoop instance ? Did you mean use the jar from amazon on our
> local server ?****
>
>  ****
>
> Thanks****
>
> On Thu, Mar 28, 2013 at 3:56 AM, David Parks <davidparks21@yahoo.com>
> wrote:****
>
> Have you tried using s3distcp from amazon? I used it many times to
> transfer 1.5TB between S3 and Hadoop instances. The process took 45 min,
> well over the 10min timeout period you=92re running into a problem on.***=
*
>
>  ****
>
> Dave****
>
>  ****
>
>  ****
>
> *From:* Himanish Kushary [mailto:himanish@gmail.com]
> *Sent:* Thursday, March 28, 2013 10:54 AM
> *To:* user@hadoop.apache.org
> *Subject:* Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput****
>
>  ****
>
> Hello,****
>
>  ****
>
> I am trying to transfer around 70 GB of files from HDFS to Amazon S3 usin=
g
> the distcp utility.There are aaround 2200 files distributed over 15
> directories.The max individual file size is approx 50 MB.****
>
>  ****
>
> The distcp mapreduce job keeps on failing with this error ****
>
>  ****
>
> "Task attempt_201303211242_0260_m_000005_0 failed to report status for
> 600 seconds. Killing!"  ****
>
>  ****
>
> and in the task attempt logs I can see lot of INFO messages like ****
>
>  ****
>
> "INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception
> (java.io.IOException) caught when processing request: Resetting to invali=
d
> mark"****
>
>  ****
>
> I am thinking either transferring individual folders instead of the entir=
e
> 70 GB folders as a workaround or as another option increasing the "
> mapred.task.timeout" parameter to something like 6-7 hour ( as the avg
> rate of transfer to S3 seems to be 5 MB/s).Is there any other better
> option to increase the throughput for transferring bulk data from HDFS to
> S3 ?  Looking forward for suggestions.****
>
>  ****
>
>  ****
>
> --
> Thanks & Regards
> Himanish ****
>
>
>
> ****
>
>  ****
>
> --
> Thanks & Regards
> Himanish ****
>
>
>
> ****
>
> ** **
>
> --
> Thanks & Regards
> Himanish ****
>
>
>
> ****
>
> ** **
>
> --
> Thanks & Regards
> Himanish ****
>


--=20
Thanks & Regards
Himanish

--089e0112c7a0d2e64004d94c192f
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I was able to transfer the data to S3 successfully with th=
e earlier mentioned work-around.Also I was able to max out our available up=
load bandwidth.I could get average around 10 MB/s from the cluster.<div><br=
>
</div><div style>I ran the s3distcp jobs with the default timeout and did n=
ot face any issues.</div><div style><br></div><div style>Thanks all for the=
 help.=A0<br></div><div style><br></div><div style>Himanish</div></div><div=
 class=3D"gmail_extra">
<br><br><div class=3D"gmail_quote">On Sat, Mar 30, 2013 at 9:26 PM, David P=
arks <span dir=3D"ltr">&lt;<a href=3D"mailto:davidparks21@yahoo.com" target=
=3D"_blank">davidparks21@yahoo.com</a>&gt;</span> wrote:<br><blockquote cla=
ss=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;pa=
dding-left:1ex">
<div lang=3D"EN-US" link=3D"blue" vlink=3D"purple"><div><p class=3D"MsoNorm=
al"><span style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;s=
ans-serif&quot;;color:#1f497d">4-20MB/sec are common transfer rates from S3=
 to *<b>1</b>* local AWS box, this was, of course, a cluster, and s3distcp =
is specifically designed to take advantage of the cluster, so it was a 45 m=
inute job to transfer the 1.5 TB to the full cluster of, I forget how many =
servers I had at the time, maybe 15-30 m1.xlarge. The numbers are rough, I =
could be mistaken and it was 1 =BD hours to do the transfer (but I recall 4=
5 min), in either case the s3distcp job ran longer than the task timeout pe=
riod, which was the real point I was focusing on.<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>=A0<u></u></span><=
/p><p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot=
;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">I seem to recall needi=
ng to re-package their jar as well, but for different reasons, they package=
 in some other open source utilities and I had version conflicts, so might =
want to watch for that.<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>=A0<u></u></span><=
/p><p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot=
;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">I=92ve never seen this=
 ProgressableResettableBufferedFileInputStream, so I can=92t offer much mor=
e advise on that one.<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>=A0<u></u></span><=
/p><p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot=
;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">Good luck! Let us know=
 how it turns out.<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">Dave<u></u><u></u></span>=
</p><p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quo=
t;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>=A0<u></u></sp=
an></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>=A0<u></u></span><=
/p><p class=3D"MsoNormal"><b><span style=3D"font-size:10.0pt;font-family:&q=
uot;Tahoma&quot;,&quot;sans-serif&quot;">From:</span></b><span style=3D"fon=
t-size:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;"> Himan=
ish Kushary [mailto:<a href=3D"mailto:himanish@gmail.com" target=3D"_blank"=
>himanish@gmail.com</a>] <br>
<b>Sent:</b> Friday, March 29, 2013 9:57 PM</span></p><div><div class=3D"h5=
"><br><b>To:</b> <a href=3D"mailto:user@hadoop.apache.org" target=3D"_blank=
">user@hadoop.apache.org</a><br><b>Subject:</b> Re: Hadoop distcp from CDH4=
 to Amazon S3 - Improve Throughput<u></u><u></u></div>
</div><p></p><div><div class=3D"h5"><p class=3D"MsoNormal"><u></u>=A0<u></u=
></p><div><p class=3D"MsoNormal">Yes you are right CDH4 is the 2.x line, bu=
t I even checked in the javadocs for 1.0.4 branch (could=A0not find 1.0.3 A=
PI&#39;s so used <a href=3D"http://hadoop.apache.org/docs/r1.0.4/api/index.=
html" target=3D"_blank">http://hadoop.apache.org/docs/r1.0.4/api/index.html=
</a>) but did not find the<span style=3D"font-family:&quot;Arial&quot;,&quo=
t;sans-serif&quot;"> &quot;<span style>ProgressableResettableBufferedFileIn=
putStream&quot; class.Not sure how it is present in the hadoop-core.jar in =
Amazon EMR.</span></span><u></u><u></u></p>
<div><p class=3D"MsoNormal"><u></u>=A0<u></u></p></div><div><p class=3D"Mso=
Normal">In the meantime I have come out with a dirty workaround by extracti=
ng the class from the Amazon jar and packaging it into its own separate jar=
.I am actually able to run the s3distcp now on local CDH4 using amazon&#39;=
s jar and transfer from my local hadoop to Amazon S3.<u></u><u></u></p>
</div><div><p class=3D"MsoNormal"><u></u>=A0<u></u></p></div><div><p class=
=3D"MsoNormal">But the real issue is the throughput. You mentioned that you=
 had transferred 1.5 TB in 45 mins which comes to around 583 MB/s. I am bar=
ely getting 4 MB/s upload speed !! How did you get 100x times speed compare=
d to me ? Could you please share any settings/tweaks that you may have done=
 to=A0achieve=A0this. Were you on some very specific high bandwidth network=
 ? Was is between HDFS on EC2 and amazon S3 ?<u></u><u></u></p>
</div><div><p class=3D"MsoNormal"><u></u>=A0<u></u></p></div><div><p class=
=3D"MsoNormal">Looking forward to hear from you.<u></u><u></u></p></div><di=
v><p class=3D"MsoNormal"><u></u>=A0<u></u></p></div><div><p class=3D"MsoNor=
mal">Thanks<u></u><u></u></p>
</div><div><p class=3D"MsoNormal">Himanish<u></u><u></u></p></div><div><p c=
lass=3D"MsoNormal" style=3D"margin-bottom:12.0pt"><u></u>=A0<u></u></p><div=
><p class=3D"MsoNormal">On Fri, Mar 29, 2013 at 10:34 AM, David Parks &lt;<=
a href=3D"mailto:davidparks21@yahoo.com" target=3D"_blank">davidparks21@yah=
oo.com</a>&gt; wrote:<u></u><u></u></p>
<p class=3D"MsoNormal">CDH4 can be either 1.x or2.x hadoop, are you using t=
he 2.x line? I&#39;ve used it primarily with 1.0.3, which is what AWS uses,=
 so I presume that&#39;s what it&#39;s tested on.<u></u><u></u></p><div><di=
v>
<p class=3D"MsoNormal" style=3D"margin-bottom:12.0pt"><br><br>Himanish Kush=
ary &lt;<a href=3D"mailto:himanish@gmail.com" target=3D"_blank">himanish@gm=
ail.com</a>&gt; wrote:<u></u><u></u></p><div><p class=3D"MsoNormal">Thanks =
Dave.<u></u><u></u></p>
<div><p class=3D"MsoNormal"><u></u>=A0<u></u></p></div><div><p class=3D"Mso=
Normal">I had already tried using the s3distcp jar. But got stuck on the be=
low error,which made me think that this is something specific to Amazon had=
oop distribution.<u></u><u></u></p>
</div><div><p class=3D"MsoNormal"><u></u>=A0<u></u></p></div><div><p class=
=3D"MsoNormal"><span style=3D"font-size:13.5pt;font-family:&quot;Tahoma&quo=
t;,&quot;sans-serif&quot;">Exception in thread &quot;Thread-28&quot; java.l=
ang.NoClassDefFoundError: org/apache/hadoop/fs/s3native/ProgressableResetta=
bleBufferedFileInputStream</span>=A0<u></u><u></u></p>
</div><div><p class=3D"MsoNormal"><u></u>=A0<u></u></p></div><div><p class=
=3D"MsoNormal">Also, I noticed that the Amazon EMR hadoop-core.jar has this=
 class but it is not present on the CDH4 (my local env) hadoop jars.<u></u>=
<u></u></p>
</div><div><p class=3D"MsoNormal"><u></u>=A0<u></u></p></div><div><p class=
=3D"MsoNormal">Could you suggest how I could get around this issue. One opt=
ion could be using the amazon specific jars but then probably I would need =
to get all the jars ( else it could cause version mismatch errors for HDFS =
- NoSuchMethodError etc etc )=A0<u></u><u></u></p>
</div><div><p class=3D"MsoNormal"><u></u>=A0<u></u></p></div><div><p class=
=3D"MsoNormal">Appreciate your help regarding this.<u></u><u></u></p></div>=
<div><p class=3D"MsoNormal"><u></u>=A0<u></u></p></div><div><p class=3D"Mso=
Normal">- Himanish<u></u><u></u></p>
</div><div><p class=3D"MsoNormal"><u></u>=A0<u></u></p><div><p class=3D"Mso=
Normal" style=3D"margin-bottom:12.0pt"><u></u>=A0<u></u></p><div><p class=
=3D"MsoNormal">On Fri, Mar 29, 2013 at 1:41 AM, David Parks &lt;<a href=3D"=
mailto:davidparks21@yahoo.com" target=3D"_blank">davidparks21@yahoo.com</a>=
&gt; wrote:<u></u><u></u></p>
<div><div><p><span style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot=
;,&quot;sans-serif&quot;;color:#1f497d">None of that complexity, they distr=
ibute the jar publicly (not the source, but the jar). You can just add this=
 to your libjars: </span><span style=3D"font-size:9.0pt;font-family:&quot;V=
erdana&quot;,&quot;sans-serif&quot;">s3n://</span><code><i><span style=3D"f=
ont-size:9.0pt;color:red">region</span></i></code><span style=3D"font-size:=
9.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;">.elasticmapre=
duce/libs/s3distcp/</span><code><i><span style=3D"font-size:9.0pt;color:red=
">latest</span></i></code><span style=3D"font-size:9.0pt;font-family:&quot;=
Verdana&quot;,&quot;sans-serif&quot;">/s3distcp.jar</span><u></u><u></u></p=
>
<p><span style=3D"font-size:9.0pt;font-family:&quot;Verdana&quot;,&quot;san=
s-serif&quot;">=A0</span><u></u><u></u></p><p><span style=3D"font-size:11.0=
pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">No=
 VPN or anything, if you can access the internet you can get to S3. </span>=
<u></u><u></u></p>
<p><span style=3D"font-size:9.0pt;font-family:&quot;Verdana&quot;,&quot;san=
s-serif&quot;">=A0</span><u></u><u></u></p><p><span style=3D"font-size:11.0=
pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">Fo=
llow their docs here: <a href=3D"http://docs.aws.amazon.com/ElasticMapReduc=
e/latest/DeveloperGuide/UsingEMR_s3distcp.html" target=3D"_blank">http://do=
cs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.=
html</a></span><u></u><u></u></p>
<p><span style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sa=
ns-serif&quot;;color:#1f497d">=A0</span><u></u><u></u></p><p><span style=3D=
"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;co=
lor:#1f497d">Doesn=92t matter where you=92re Hadoop instance is running.</s=
pan><u></u><u></u></p>
<p><span style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sa=
ns-serif&quot;;color:#1f497d">=A0</span><u></u><u></u></p><p><span style=3D=
"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;co=
lor:#1f497d">Here=92s an example of code/parameters I used to run it from w=
ithin another Tool, it=92s a Tool, so it=92s actually designed to run from =
the Hadoop command line normally.</span><u></u><u></u></p>
<p><span style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sa=
ns-serif&quot;;color:#1f497d">=A0</span><u></u><u></u></p><p style=3D"text-=
autospace:none"><span style=3D"font-size:10.0pt;font-family:Consolas">=A0=
=A0=A0=A0=A0=A0 ToolRunner.<i>run</i>(getConf(), <b><span style=3D"color:#7=
f0055">new</span></b> S3DistCp(), <b><span style=3D"color:#7f0055">new</spa=
n></b> String[] {</span><u></u><u></u></p>
<p style=3D"text-autospace:none"><span style=3D"font-size:10.0pt;font-famil=
y:Consolas">=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 <span style=3D"color:#2=
a00ff">&quot;--src&quot;</span>, =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 <span st=
yle=3D"color:#2a00ff">&quot;/frugg/image-cache-stage2/&quot;</span>,</span>=
<u></u><u></u></p>
<p style=3D"text-autospace:none"><span style=3D"font-size:10.0pt;font-famil=
y:Consolas">=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 <span style=3D"color:#2=
a00ff">&quot;--srcPattern&quot;</span>,=A0=A0=A0=A0=A0 <span style=3D"color=
:#2a00ff">&quot;.*part.*&quot;</span>,</span><u></u><u></u></p>
<p style=3D"text-autospace:none"><span style=3D"font-size:10.0pt;font-famil=
y:Consolas">=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 <span style=3D"color:#2=
a00ff">&quot;--dest&quot;</span>, =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 <span styl=
e=3D"color:#2a00ff">&quot;s3n://fruggmapreduce/results-&quot;</span>+<span =
style=3D"color:#0000c0">env</span>+<span style=3D"color:#2a00ff">&quot;/&qu=
ot;</span> + JobUtils.<i><span style=3D"color:#0000c0">isoDate</span></i> +=
 <span style=3D"color:#2a00ff">&quot;/output/itemtable/&quot;</span>, </spa=
n><u></u><u></u></p>
<p style=3D"text-autospace:none"><span style=3D"font-size:10.0pt;font-famil=
y:Consolas">=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 <span style=3D"color:#2=
a00ff">&quot;--s3Endpoint&quot;</span>,=A0=A0=A0=A0=A0 <span style=3D"color=
:#2a00ff">&quot;<a href=3D"http://s3.amazonaws.com" target=3D"_blank">s3.am=
azonaws.com</a>&quot;</span>=A0=A0=A0=A0=A0=A0=A0=A0 });</span><u></u><u></=
u></p>
<p><span style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sa=
ns-serif&quot;;color:#1f497d">=A0</span><u></u><u></u></p><p><span style=3D=
"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;co=
lor:#1f497d">Watch the =93srcPattern=94, make sure you have that leading `.=
*`, that one threw me for a loop once.</span><u></u><u></u></p>
<p><span style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sa=
ns-serif&quot;;color:#1f497d">=A0</span><u></u><u></u></p><p><span style=3D=
"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;co=
lor:#1f497d">Dave</span><u></u><u></u></p>
<p><span style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sa=
ns-serif&quot;;color:#1f497d">=A0</span><u></u><u></u></p><p><span style=3D=
"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;co=
lor:#1f497d">=A0</span><u></u><u></u></p>
<p><b><span style=3D"font-size:10.0pt;font-family:&quot;Tahoma&quot;,&quot;=
sans-serif&quot;">From:</span></b><span style=3D"font-size:10.0pt;font-fami=
ly:&quot;Tahoma&quot;,&quot;sans-serif&quot;"> Himanish Kushary [mailto:<a =
href=3D"mailto:himanish@gmail.com" target=3D"_blank">himanish@gmail.com</a>=
] <br>
<b>Sent:</b> Thursday, March 28, 2013 5:51 PM<br><b>To:</b> <a href=3D"mail=
to:user@hadoop.apache.org" target=3D"_blank">user@hadoop.apache.org</a><br>=
<b>Subject:</b> Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughp=
ut</span><u></u><u></u></p>
<div><div><p>=A0<u></u><u></u></p><p>Hi Dave,<u></u><u></u></p><div><p>=A0<=
u></u><u></u></p></div><div><p>Thanks for your reply. Our hadoop instance i=
s inside our corporate LAN.Could you please provide some details on how i c=
ould use the s3distcp from amazon to transfer data from our on-premises had=
oop to amazon s3. Wouldn&#39;t some kind of VPN be needed between the Amazo=
n EMR instance and our on-premises hadoop instance ? Did you mean use the j=
ar from amazon on our local server ?<u></u><u></u></p>
</div><div><p>=A0<u></u><u></u></p></div><div><p style=3D"margin-bottom:12.=
0pt">Thanks<u></u><u></u></p><div><p>On Thu, Mar 28, 2013 at 3:56 AM, David=
 Parks &lt;<a href=3D"mailto:davidparks21@yahoo.com" target=3D"_blank">davi=
dparks21@yahoo.com</a>&gt; wrote:<u></u><u></u></p>
<div><div><p><span style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot=
;,&quot;sans-serif&quot;;color:#1f497d">Have you tried using s3distcp from =
amazon? I used it many times to transfer 1.5TB between S3 and Hadoop instan=
ces. The process took 45 min, well over the 10min timeout period you=92re r=
unning into a problem on.</span><u></u><u></u></p>
<p><span style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sa=
ns-serif&quot;;color:#1f497d">=A0</span><u></u><u></u></p><p><span style=3D=
"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;co=
lor:#1f497d">Dave</span><u></u><u></u></p>
<p><span style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sa=
ns-serif&quot;;color:#1f497d">=A0</span><u></u><u></u></p><p><span style=3D=
"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;co=
lor:#1f497d">=A0</span><u></u><u></u></p>
<p><b><span style=3D"font-size:10.0pt;font-family:&quot;Tahoma&quot;,&quot;=
sans-serif&quot;">From:</span></b><span style=3D"font-size:10.0pt;font-fami=
ly:&quot;Tahoma&quot;,&quot;sans-serif&quot;"> Himanish Kushary [mailto:<a =
href=3D"mailto:himanish@gmail.com" target=3D"_blank">himanish@gmail.com</a>=
] <br>
<b>Sent:</b> Thursday, March 28, 2013 10:54 AM<br><b>To:</b> <a href=3D"mai=
lto:user@hadoop.apache.org" target=3D"_blank">user@hadoop.apache.org</a><br=
><b>Subject:</b> Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput<=
/span><u></u><u></u></p>
<div><div><p>=A0<u></u><u></u></p><div><p><span style=3D"font-size:10.0pt;f=
ont-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#222222">Hello,</=
span><u></u><u></u></p></div><div><p><span style=3D"font-size:10.0pt;font-f=
amily:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#222222">=A0</span><u>=
</u><u></u></p>
</div><div><p><span style=3D"font-size:10.0pt;font-family:&quot;Arial&quot;=
,&quot;sans-serif&quot;;color:#222222">I am trying to transfer around 70 GB=
 of files from HDFS to Amazon S3 using the distcp utility.There are aaround=
 2200 files distributed over 15 directories.The max individual file size is=
 approx 50 MB.</span><u></u><u></u></p>
</div><div><p><span style=3D"font-size:10.0pt;font-family:&quot;Arial&quot;=
,&quot;sans-serif&quot;;color:#222222">=A0</span><u></u><u></u></p></div><d=
iv><p><span style=3D"font-size:10.0pt;font-family:&quot;Arial&quot;,&quot;s=
ans-serif&quot;;color:#222222">The distcp mapreduce job keeps on failing wi=
th this error=A0</span><u></u><u></u></p>
</div><div><p><span style=3D"font-size:10.0pt;font-family:&quot;Arial&quot;=
,&quot;sans-serif&quot;;color:#222222">=A0</span><u></u><u></u></p></div><d=
iv><p><span style=3D"font-size:10.0pt;font-family:&quot;Arial&quot;,&quot;s=
ans-serif&quot;;color:#222222">&quot;</span><span style=3D"font-size:9.0pt;=
font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#222222">Task at=
tempt_201303211242_0260_m_000005_0 failed to report status for 600 seconds.=
 Killing!&quot; =A0</span><u></u><u></u></p>
</div><div><p><span style=3D"font-size:10.0pt;font-family:&quot;Arial&quot;=
,&quot;sans-serif&quot;;color:#222222">=A0</span><u></u><u></u></p></div><d=
iv><p><span style=3D"font-size:9.0pt;font-family:&quot;Arial&quot;,&quot;sa=
ns-serif&quot;;color:#222222">and in the task attempt logs I can see lot of=
 INFO messages like=A0</span><u></u><u></u></p>
</div><div><p><span style=3D"font-size:10.0pt;font-family:&quot;Arial&quot;=
,&quot;sans-serif&quot;;color:#222222">=A0</span><u></u><u></u></p></div><d=
iv><p><span style=3D"font-size:9.0pt;font-family:&quot;Arial&quot;,&quot;sa=
ns-serif&quot;;color:#222222">&quot;</span><span style=3D"font-size:10.0pt;=
font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#222222">INFO or=
g.apache.commons.httpclient.HttpMethodDirector: I/O exception (java.io.IOEx=
ception) caught when processing request: Resetting to invalid mark&quot;</s=
pan><u></u><u></u></p>
</div><div><p><span style=3D"font-size:9.0pt;font-family:&quot;Arial&quot;,=
&quot;sans-serif&quot;;color:#222222">=A0</span><u></u><u></u></p></div><di=
v><p><span style=3D"font-size:10.0pt;font-family:&quot;Arial&quot;,&quot;sa=
ns-serif&quot;;color:#222222">I am thinking either transferring individual =
folders instead of the entire 70 GB folders as a workaround or as another o=
ption increasing the=A0&quot;</span><span style=3D"font-size:10.5pt;font-fa=
mily:Consolas;color:#222222">mapred.task.timeout&quot;=A0</span><span style=
=3D"font-size:10.0pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;c=
olor:#222222">parameter to something like 6-7 hour ( as the avg rate of tra=
nsfer to S3 seems to be 5 MB/s)</span><span style=3D"font-size:10.0pt;font-=
family:Consolas;color:#222222">.</span><span style=3D"font-size:10.0pt;font=
-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#222222">Is there an=
y other better option to increase the throughput for transferring bulk data=
 from HDFS to S3 ? =A0Looking forward for suggestions.</span><u></u><u></u>=
</p>
</div><div><p>=A0<u></u><u></u></p></div><div><p>=A0<u></u><u></u></p></div=
><p>-- <br>Thanks &amp; Regards<br>Himanish <u></u><u></u></p></div></div><=
/div></div></div><p><br><br clear=3D"all"><u></u><u></u></p><div><p>=A0<u><=
/u><u></u></p>
</div><p>-- <br>Thanks &amp; Regards<br>Himanish <u></u><u></u></p></div></=
div></div></div></div></div><p class=3D"MsoNormal"><br><br clear=3D"all"><u=
></u><u></u></p><div><p class=3D"MsoNormal"><u></u>=A0<u></u></p></div><p c=
lass=3D"MsoNormal">
-- <br>Thanks &amp; Regards<br>Himanish <u></u><u></u></p></div></div></div=
></div></div></div><p class=3D"MsoNormal"><br><br clear=3D"all"><u></u><u><=
/u></p><div><p class=3D"MsoNormal"><u></u>=A0<u></u></p></div><p class=3D"M=
soNormal">
-- <br>Thanks &amp; Regards<br>Himanish <u></u><u></u></p></div></div></div=
></div></div></div></blockquote></div><br><br clear=3D"all"><div><br></div>=
-- <br>Thanks &amp; Regards<br>Himanish
</div>

--089e0112c7a0d2e64004d94c192f--