Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of himanish@gmail.com designates
 209.85.217.176 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <l06bv3cgwxuaerlu2sypmxk6.1364567646793@email.android.com>
References: <l06bv3cgwxuaerlu2sypmxk6.1364567646793@email.android.com>
Date: Fri, 29 Mar 2013 10:57:12 -0400
Message-ID: 
 <CAB1PEPK_R_-OpqSr70L6dFiENv6ZB4CYuHEUYa5ikwSNWw3GTw@mail.gmail.com>
Subject: Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput
From: Himanish Kushary <himanish@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=e0cb4efe2dac77c2fe04d9117c68

--e0cb4efe2dac77c2fe04d9117c68
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Yes you are right CDH4 is the 2.x line, but I even checked in the javadocs
for 1.0.4 branch (could not find 1.0.3 API's so used
http://hadoop.apache.org/docs/r1.0.4/api/index.html) but did not find
the "ProgressableResettableBufferedFileInputStream"
class.Not sure how it is present in the hadoop-core.jar in Amazon EMR.

In the meantime I have come out with a dirty workaround by extracting the
class from the Amazon jar and packaging it into its own separate jar.I am
actually able to run the s3distcp now on local CDH4 using amazon's jar and
transfer from my local hadoop to Amazon S3.

But the real issue is the throughput. You mentioned that you had
transferred 1.5 TB in 45 mins which comes to around 583 MB/s. I am barely
getting 4 MB/s upload speed !! How did you get 100x times speed compared to
me ? Could you please share any settings/tweaks that you may have done
to achieve this. Were you on some very specific high bandwidth network ?
Was is between HDFS on EC2 and amazon S3 ?

Looking forward to hear from you.

Thanks
Himanish


On Fri, Mar 29, 2013 at 10:34 AM, David Parks <davidparks21@yahoo.com>wrote=
:

> CDH4 can be either 1.x or2.x hadoop, are you using the 2.x line? I've use=
d
> it primarily with 1.0.3, which is what AWS uses, so I presume that's what
> it's tested on.
>
>
> Himanish Kushary <himanish@gmail.com> wrote:
>
> Thanks Dave.
>
> I had already tried using the s3distcp jar. But got stuck on the below
> error,which made me think that this is something specific to Amazon hadoo=
p
> distribution.
>
> Exception in thread "Thread-28" java.lang.NoClassDefFoundError:
> org/apache/hadoop/fs/s3native/ProgressableResettableBufferedFileInputStre=
am
>
>
> Also, I noticed that the Amazon EMR hadoop-core.jar has this class but it
> is not present on the CDH4 (my local env) hadoop jars.
>
> Could you suggest how I could get around this issue. One option could be
> using the amazon specific jars but then probably I would need to get all
> the jars ( else it could cause version mismatch errors for HDFS -
> NoSuchMethodError etc etc )
>
> Appreciate your help regarding this.
>
> - Himanish
>
>
>
> On Fri, Mar 29, 2013 at 1:41 AM, David Parks <davidparks21@yahoo.com>wrot=
e:
>
>> None of that complexity, they distribute the jar publicly (not the
>> source, but the jar). You can just add this to your libjars: s3n://*
>> region*.elasticmapreduce/libs/s3distcp/*latest*/s3distcp.jar****
>>
>> ** **
>>
>> No VPN or anything, if you can access the internet you can get to S3. **=
*
>> *
>>
>> ** **
>>
>> Follow their docs here:
>> http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingE=
MR_s3distcp.html
>> ****
>>
>> ** **
>>
>> Doesn=92t matter where you=92re Hadoop instance is running.****
>>
>> ** **
>>
>> Here=92s an example of code/parameters I used to run it from within anot=
her
>> Tool, it=92s a Tool, so it=92s actually designed to run from the Hadoop =
command
>> line normally.****
>>
>> ** **
>>
>>        ToolRunner.*run*(getConf(), *new* S3DistCp(), *new* String[] {***=
*
>>
>>               "--src",             "/frugg/image-cache-stage2/",****
>>
>>               "--srcPattern",      ".*part.*",****
>>
>>               "--dest",            "s3n://fruggmapreduce/results-"+env+
>> "/" + JobUtils.*isoDate* + "/output/itemtable/", ****
>>
>>               "--s3Endpoint",      "s3.amazonaws.com"         });****
>>
>> ** **
>>
>> Watch the =93srcPattern=94, make sure you have that leading `.*`, that o=
ne
>> threw me for a loop once.****
>>
>> ** **
>>
>> Dave****
>>
>> ** **
>>
>> ** **
>>
>> *From:* Himanish Kushary [mailto:himanish@gmail.com]
>> *Sent:* Thursday, March 28, 2013 5:51 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput=
*
>> ***
>>
>> ** **
>>
>> Hi Dave,****
>>
>> ** **
>>
>> Thanks for your reply. Our hadoop instance is inside our corporate
>> LAN.Could you please provide some details on how i could use the s3distc=
p
>> from amazon to transfer data from our on-premises hadoop to amazon s3.
>> Wouldn't some kind of VPN be needed between the Amazon EMR instance and =
our
>> on-premises hadoop instance ? Did you mean use the jar from amazon on ou=
r
>> local server ?****
>>
>> ** **
>>
>> Thanks****
>>
>> On Thu, Mar 28, 2013 at 3:56 AM, David Parks <davidparks21@yahoo.com>
>> wrote:****
>>
>> Have you tried using s3distcp from amazon? I used it many times to
>> transfer 1.5TB between S3 and Hadoop instances. The process took 45 min,
>> well over the 10min timeout period you=92re running into a problem on.**=
**
>>
>>  ****
>>
>> Dave****
>>
>>  ****
>>
>>  ****
>>
>> *From:* Himanish Kushary [mailto:himanish@gmail.com]
>> *Sent:* Thursday, March 28, 2013 10:54 AM
>> *To:* user@hadoop.apache.org
>> *Subject:* Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput****
>>
>>  ****
>>
>> Hello,****
>>
>>  ****
>>
>> I am trying to transfer around 70 GB of files from HDFS to Amazon S3
>> using the distcp utility.There are aaround 2200 files distributed over 1=
5
>> directories.The max individual file size is approx 50 MB.****
>>
>>  ****
>>
>> The distcp mapreduce job keeps on failing with this error ****
>>
>>  ****
>>
>> "Task attempt_201303211242_0260_m_000005_0 failed to report status for
>> 600 seconds. Killing!"  ****
>>
>>  ****
>>
>> and in the task attempt logs I can see lot of INFO messages like ****
>>
>>  ****
>>
>> "INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception
>> (java.io.IOException) caught when processing request: Resetting to inval=
id
>> mark"****
>>
>>  ****
>>
>> I am thinking either transferring individual folders instead of the
>> entire 70 GB folders as a workaround or as another option increasing the=
 "
>> mapred.task.timeout" parameter to something like 6-7 hour ( as the avg
>> rate of transfer to S3 seems to be 5 MB/s).Is there any other better
>> option to increase the throughput for transferring bulk data from HDFS t=
o
>> S3 ?  Looking forward for suggestions.****
>>
>>  ****
>>
>>  ****
>>
>> --
>> Thanks & Regards
>> Himanish ****
>>
>>
>>
>> ****
>>
>> ** **
>>
>> --
>> Thanks & Regards
>> Himanish ****
>>
>
>
>
> --
> Thanks & Regards
> Himanish
>


--=20
Thanks & Regards
Himanish

--e0cb4efe2dac77c2fe04d9117c68
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Yes you are right CDH4 is the 2.x line, but I even checked=
 in the javadocs for 1.0.4 branch (could=A0not find 1.0.3 API&#39;s so used=
 <a href=3D"http://hadoop.apache.org/docs/r1.0.4/api/index.html">http://had=
oop.apache.org/docs/r1.0.4/api/index.html</a>) but did not find the<font fa=
ce=3D"arial, helvetica, sans-serif"> &quot;<span style=3D"color:rgb(0,0,0)"=
>ProgressableResettableBufferedFileInputStream&quot; class.Not sure how it =
is present in the hadoop-core.jar in Amazon EMR.</span></font><div class=3D=
"gmail_extra">
<br></div><div class=3D"gmail_extra">In the meantime I have come out with a=
 dirty workaround by extracting the class from the Amazon jar and packaging=
 it into its own separate jar.I am actually able to run the s3distcp now on=
 local CDH4 using amazon&#39;s jar and transfer from my local hadoop to Ama=
zon S3.</div>
<div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra" style>But t=
he real issue is the throughput. You mentioned that you had transferred 1.5=
 TB in 45 mins which comes to around 583 MB/s. I am barely getting 4 MB/s u=
pload speed !! How did you get 100x times speed compared to me ? Could you =
please share any settings/tweaks that you may have done to=A0achieve=A0this=
. Were you on some very specific high bandwidth network ? Was is between HD=
FS on EC2 and amazon S3 ?</div>
<div class=3D"gmail_extra" style><br></div><div class=3D"gmail_extra" style=
>Looking forward to hear from you.</div><div class=3D"gmail_extra" style><b=
r></div><div class=3D"gmail_extra" style>Thanks</div><div class=3D"gmail_ex=
tra" style>
Himanish</div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote"=
>On Fri, Mar 29, 2013 at 10:34 AM, David Parks <span dir=3D"ltr">&lt;<a hre=
f=3D"mailto:davidparks21@yahoo.com" target=3D"_blank">davidparks21@yahoo.co=
m</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex">CDH4 can be either 1.x or2.x hadoop, are you using the 2.x=
 line? I&#39;ve used it primarily with 1.0.3, which is what AWS uses, so I =
presume that&#39;s what it&#39;s tested on.<div class=3D"">
<div class=3D"h5"><br><br>Himanish Kushary &lt;<a href=3D"mailto:himanish@g=
mail.com" target=3D"_blank">himanish@gmail.com</a>&gt; wrote:<br><br><div d=
ir=3D"ltr">Thanks Dave.<div><br></div><div>I had already tried using the s3=
distcp jar. But got stuck on the below error,which made me think that this =
is something specific to Amazon hadoop distribution.</div>
<div><br></div>
<div><span style=3D"font-size:medium;font-family:Tahoma">Exception in threa=
d &quot;Thread-28&quot; java.lang.NoClassDefFoundError: org/apache/hadoop/f=
s/s3native/ProgressableResettableBufferedFileInputStream</span>=A0<br>
</div><div><br></div><div>Also, I noticed that the Amazon EMR hadoop-core.j=
ar has this class but it is not present on the CDH4 (my local env) hadoop j=
ars.</div><div><br></div><div>Could you suggest how I could get around this=
 issue. One option could be using the amazon specific jars but then probabl=
y I would need to get all the jars ( else it could cause version mismatch e=
rrors for HDFS - NoSuchMethodError etc etc )=A0</div>

<div><br></div><div>Appreciate your help regarding this.</div><div><br></di=
v><div>- Himanish</div><div><br><div class=3D"gmail_extra"><br><br><div cla=
ss=3D"gmail_quote">On Fri, Mar 29, 2013 at 1:41 AM, David Parks <span dir=
=3D"ltr">&lt;<a href=3D"mailto:davidparks21@yahoo.com" target=3D"_blank">da=
vidparks21@yahoo.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex"><div lang=3D"EN-US" link=3D"blue" vlink=3D"purple"><div><p=
>
<span style=3D"font-size:11pt;font-family:Calibri,sans-serif;color:rgb(31,7=
3,125)">None of that complexity, they distribute the jar publicly (not the =
source, but the jar). You can just add this to your libjars: </span><span s=
tyle=3D"font-size:9pt;font-family:Verdana,sans-serif">s3n://</span><code><i=
><span style=3D"font-size:9pt;color:red">region</span></i></code><span styl=
e=3D"font-size:9pt;font-family:Verdana,sans-serif">.elasticmapreduce/libs/s=
3distcp/</span><code><i><span style=3D"font-size:9pt;color:red">latest</spa=
n></i></code><span style=3D"font-size:9pt;font-family:Verdana,sans-serif">/=
s3distcp.jar<u></u><u></u></span></p>

<p><span style=3D"font-size:9pt;font-family:Verdana,sans-serif"><u></u>=A0<=
u></u></span></p><p><span style=3D"font-size:11pt;font-family:Calibri,sans-=
serif;color:rgb(31,73,125)">No VPN or anything, if you can access the inter=
net you can get to S3. <u></u><u></u></span></p>

<p><span style=3D"font-size:9pt;font-family:Verdana,sans-serif"><u></u>=A0<=
u></u></span></p><p><span style=3D"font-size:11pt;font-family:Calibri,sans-=
serif;color:rgb(31,73,125)">Follow their docs here: <a href=3D"http://docs.=
aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.htm=
l" target=3D"_blank">http://docs.aws.amazon.com/ElasticMapReduce/latest/Dev=
eloperGuide/UsingEMR_s3distcp.html</a><u></u><u></u></span></p>

<p><span style=3D"font-size:11pt;font-family:Calibri,sans-serif;color:rgb(3=
1,73,125)"><u></u>=A0<u></u></span></p><p><span style=3D"font-size:11pt;fon=
t-family:Calibri,sans-serif;color:rgb(31,73,125)">Doesn=92t matter where yo=
u=92re Hadoop instance is running.<u></u><u></u></span></p>

<p><span style=3D"font-size:11pt;font-family:Calibri,sans-serif;color:rgb(3=
1,73,125)"><u></u>=A0<u></u></span></p><p><span style=3D"font-size:11pt;fon=
t-family:Calibri,sans-serif;color:rgb(31,73,125)">Here=92s an example of co=
de/parameters I used to run it from within another Tool, it=92s a Tool, so =
it=92s actually designed to run from the Hadoop command line normally.<u></=
u><u></u></span></p>

<p><span style=3D"font-size:11pt;font-family:Calibri,sans-serif;color:rgb(3=
1,73,125)"><u></u>=A0<u></u></span></p><p style=3D"text-autospace:none"><sp=
an style=3D"font-size:10pt;font-family:Consolas">=A0=A0=A0=A0=A0=A0 ToolRun=
ner.<i>run</i>(getConf(), </span><b><span style=3D"font-size:10pt;font-fami=
ly:Consolas;color:rgb(127,0,85)">new</span></b><span style=3D"font-size:10p=
t;font-family:Consolas"> S3DistCp(), </span><b><span style=3D"font-size:10p=
t;font-family:Consolas;color:rgb(127,0,85)">new</span></b><span style=3D"fo=
nt-size:10pt;font-family:Consolas"> String[] {</span><span style=3D"font-si=
ze:10pt;font-family:Consolas"><u></u><u></u></span></p>

<p style=3D"text-autospace:none"><span style=3D"font-size:10pt;font-family:=
Consolas">=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 </span><span style=3D"fon=
t-size:10pt;font-family:Consolas;color:rgb(42,0,255)">&quot;--src&quot;</sp=
an><span style=3D"font-size:10pt;font-family:Consolas">, =A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0 </span><span style=3D"font-size:10pt;font-family:Consolas;c=
olor:rgb(42,0,255)">&quot;/frugg/image-cache-stage2/&quot;</span><span styl=
e=3D"font-size:10pt;font-family:Consolas">,</span><span style=3D"font-size:=
10pt;font-family:Consolas"><u></u><u></u></span></p>

<p style=3D"text-autospace:none"><span style=3D"font-size:10pt;font-family:=
Consolas">=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 </span><span style=3D"fon=
t-size:10pt;font-family:Consolas;color:rgb(42,0,255)">&quot;--srcPattern&qu=
ot;</span><span style=3D"font-size:10pt;font-family:Consolas">,=A0=A0=A0=A0=
=A0 </span><span style=3D"font-size:10pt;font-family:Consolas;color:rgb(42,=
0,255)">&quot;.*part.*&quot;</span><span style=3D"font-size:10pt;font-famil=
y:Consolas">,</span><span style=3D"font-size:10pt;font-family:Consolas"><u>=
</u><u></u></span></p>

<p style=3D"text-autospace:none"><span style=3D"font-size:10pt;font-family:=
Consolas">=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 </span><span style=3D"fon=
t-size:10pt;font-family:Consolas;color:rgb(42,0,255)">&quot;--dest&quot;</s=
pan><span style=3D"font-size:10pt;font-family:Consolas">, =A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0 </span><span style=3D"font-size:10pt;font-family:Consolas;c=
olor:rgb(42,0,255)">&quot;s3n://fruggmapreduce/results-&quot;</span><span s=
tyle=3D"font-size:10pt;font-family:Consolas">+</span><span style=3D"font-si=
ze:10pt;font-family:Consolas;color:rgb(0,0,192)">env</span><span style=3D"f=
ont-size:10pt;font-family:Consolas">+</span><span style=3D"font-size:10pt;f=
ont-family:Consolas;color:rgb(42,0,255)">&quot;/&quot;</span><span style=3D=
"font-size:10pt;font-family:Consolas"> + JobUtils.</span><i><span style=3D"=
font-size:10pt;font-family:Consolas;color:rgb(0,0,192)">isoDate</span></i><=
span style=3D"font-size:10pt;font-family:Consolas"> + </span><span style=3D=
"font-size:10pt;font-family:Consolas;color:rgb(42,0,255)">&quot;/output/ite=
mtable/&quot;</span><span style=3D"font-size:10pt;font-family:Consolas">, <=
/span><span style=3D"font-size:10pt;font-family:Consolas"><u></u><u></u></s=
pan></p>

<p style=3D"text-autospace:none"><span style=3D"font-size:10pt;font-family:=
Consolas">=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 </span><span style=3D"fon=
t-size:10pt;font-family:Consolas;color:rgb(42,0,255)">&quot;--s3Endpoint&qu=
ot;</span><span style=3D"font-size:10pt;font-family:Consolas">,=A0=A0=A0=A0=
=A0 </span><span style=3D"font-size:10pt;font-family:Consolas;color:rgb(42,=
0,255)">&quot;<a href=3D"http://s3.amazonaws.com" target=3D"_blank">s3.amaz=
onaws.com</a>&quot;</span><span style=3D"font-size:10pt;font-family:Consola=
s">=A0=A0=A0=A0=A0=A0=A0=A0 });</span><span style=3D"font-size:10pt;font-fa=
mily:Consolas"><u></u><u></u></span></p>

<p><span style=3D"font-size:11pt;font-family:Calibri,sans-serif;color:rgb(3=
1,73,125)"><u></u>=A0<u></u></span></p><p><span style=3D"font-size:11pt;fon=
t-family:Calibri,sans-serif;color:rgb(31,73,125)">Watch the =93srcPattern=
=94, make sure you have that leading `.*`, that one threw me for a loop onc=
e.<u></u><u></u></span></p>

<p><span style=3D"font-size:11pt;font-family:Calibri,sans-serif;color:rgb(3=
1,73,125)"><u></u>=A0<u></u></span></p><p><span style=3D"font-size:11pt;fon=
t-family:Calibri,sans-serif;color:rgb(31,73,125)">Dave<u></u><u></u></span>=
</p>

<p><span style=3D"font-size:11pt;font-family:Calibri,sans-serif;color:rgb(3=
1,73,125)"><u></u>=A0<u></u></span></p><p><span style=3D"font-size:11pt;fon=
t-family:Calibri,sans-serif;color:rgb(31,73,125)"><u></u>=A0<u></u></span><=
/p>

<p><b><span style=3D"font-size:10pt;font-family:Tahoma,sans-serif">From:</s=
pan></b><span style=3D"font-size:10pt;font-family:Tahoma,sans-serif"> Himan=
ish Kushary [mailto:<a href=3D"mailto:himanish@gmail.com" target=3D"_blank"=
>himanish@gmail.com</a>] <br>

<b>Sent:</b> Thursday, March 28, 2013 5:51 PM<br><b>To:</b> <a href=3D"mail=
to:user@hadoop.apache.org" target=3D"_blank">user@hadoop.apache.org</a><br>=
<b>Subject:</b> Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughp=
ut<u></u><u></u></span></p>

<div><div><p><u></u>=A0<u></u></p><p>Hi Dave,<u></u><u></u></p><div><p><u><=
/u>=A0<u></u></p></div><div><p>Thanks for your reply. Our hadoop instance i=
s inside our corporate LAN.Could you please provide some details on how i c=
ould use the s3distcp from amazon to transfer data from our on-premises had=
oop to amazon s3. Wouldn&#39;t some kind of VPN be needed between the Amazo=
n EMR instance and our on-premises hadoop instance ? Did you mean use the j=
ar from amazon on our local server ?<u></u><u></u></p>

</div><div><p><u></u>=A0<u></u></p></div><div><p style=3D"margin-bottom:12p=
t">Thanks<u></u><u></u></p><div><p>On Thu, Mar 28, 2013 at 3:56 AM, David P=
arks &lt;<a href=3D"mailto:davidparks21@yahoo.com" target=3D"_blank">davidp=
arks21@yahoo.com</a>&gt; wrote:<u></u><u></u></p>

<div><div><p><span style=3D"font-size:11pt;font-family:Calibri,sans-serif;c=
olor:rgb(31,73,125)">Have you tried using s3distcp from amazon? I used it m=
any times to transfer 1.5TB between S3 and Hadoop instances. The process to=
ok 45 min, well over the 10min timeout period you=92re running into a probl=
em on.</span><u></u><u></u></p>

<p><span style=3D"font-size:11pt;font-family:Calibri,sans-serif;color:rgb(3=
1,73,125)">=A0</span><u></u><u></u></p><p><span style=3D"font-size:11pt;fon=
t-family:Calibri,sans-serif;color:rgb(31,73,125)">Dave</span><u></u><u></u>=
</p>

<p><span style=3D"font-size:11pt;font-family:Calibri,sans-serif;color:rgb(3=
1,73,125)">=A0</span><u></u><u></u></p><p><span style=3D"font-size:11pt;fon=
t-family:Calibri,sans-serif;color:rgb(31,73,125)">=A0</span><u></u><u></u><=
/p>

<p><b><span style=3D"font-size:10pt;font-family:Tahoma,sans-serif">From:</s=
pan></b><span style=3D"font-size:10pt;font-family:Tahoma,sans-serif"> Himan=
ish Kushary [mailto:<a href=3D"mailto:himanish@gmail.com" target=3D"_blank"=
>himanish@gmail.com</a>] <br>

<b>Sent:</b> Thursday, March 28, 2013 10:54 AM<br><b>To:</b> <a href=3D"mai=
lto:user@hadoop.apache.org" target=3D"_blank">user@hadoop.apache.org</a><br=
><b>Subject:</b> Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput<=
/span><u></u><u></u></p>

<div><div><p>=A0<u></u><u></u></p><div><p><span style=3D"font-size:10pt;fon=
t-family:Arial,sans-serif;color:rgb(34,34,34)">Hello,</span><u></u><u></u><=
/p></div><div><p><span style=3D"font-size:10pt;font-family:Arial,sans-serif=
;color:rgb(34,34,34)">=A0</span><u></u><u></u></p>

</div><div><p><span style=3D"font-size:10pt;font-family:Arial,sans-serif;co=
lor:rgb(34,34,34)">I am trying to transfer around 70 GB of files from HDFS =
to Amazon S3 using the distcp utility.There are aaround 2200 files distribu=
ted over 15 directories.The max individual file size is approx 50 MB.</span=
><u></u><u></u></p>

</div><div><p><span style=3D"font-size:10pt;font-family:Arial,sans-serif;co=
lor:rgb(34,34,34)">=A0</span><u></u><u></u></p></div><div><p><span style=3D=
"font-size:10pt;font-family:Arial,sans-serif;color:rgb(34,34,34)">The distc=
p mapreduce job keeps on failing with this error=A0</span><u></u><u></u></p=
>

</div><div><p><span style=3D"font-size:10pt;font-family:Arial,sans-serif;co=
lor:rgb(34,34,34)">=A0</span><u></u><u></u></p></div><div><p><span style=3D=
"font-size:10pt;font-family:Arial,sans-serif;color:rgb(34,34,34)">&quot;</s=
pan><span style=3D"font-size:9pt;font-family:Arial,sans-serif;color:rgb(34,=
34,34)">Task attempt_201303211242_0260_m_000005_0 failed to report status f=
or 600 seconds. Killing!&quot; =A0</span><u></u><u></u></p>

</div><div><p><span style=3D"font-size:10pt;font-family:Arial,sans-serif;co=
lor:rgb(34,34,34)">=A0</span><u></u><u></u></p></div><div><p><span style=3D=
"font-size:9pt;font-family:Arial,sans-serif;color:rgb(34,34,34)">and in the=
 task attempt logs I can see lot of INFO messages like=A0</span><u></u><u><=
/u></p>

</div><div><p><span style=3D"font-size:10pt;font-family:Arial,sans-serif;co=
lor:rgb(34,34,34)">=A0</span><u></u><u></u></p></div><div><p><span style=3D=
"font-size:9pt;font-family:Arial,sans-serif;color:rgb(34,34,34)">&quot;</sp=
an><span style=3D"font-size:10pt;font-family:Arial,sans-serif;color:rgb(34,=
34,34)">INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exceptio=
n (java.io.IOException) caught when processing request: Resetting to invali=
d mark&quot;</span><u></u><u></u></p>

</div><div><p><span style=3D"font-size:9pt;font-family:Arial,sans-serif;col=
or:rgb(34,34,34)">=A0</span><u></u><u></u></p></div><div><p><span style=3D"=
font-size:10pt;font-family:Arial,sans-serif;color:rgb(34,34,34)">I am think=
ing either transferring individual folders instead of the entire 70 GB fold=
ers as a workaround or as another option increasing the=A0&quot;</span><spa=
n style=3D"font-size:10.5pt;font-family:Consolas;color:rgb(34,34,34)">mapre=
d.task.timeout&quot;=A0</span><span style=3D"font-size:10pt;font-family:Ari=
al,sans-serif;color:rgb(34,34,34)">parameter to something like 6-7 hour ( a=
s the avg rate of transfer to S3 seems to be 5 MB/s)</span><span style=3D"f=
ont-size:10pt;font-family:Consolas;color:rgb(34,34,34)">.</span><span style=
=3D"font-size:10pt;font-family:Arial,sans-serif;color:rgb(34,34,34)">Is the=
re any other better option to increase the throughput for transferring bulk=
 data from HDFS to S3 ? =A0Looking forward for suggestions.</span><u></u><u=
></u></p>

</div><div><p>=A0<u></u><u></u></p></div><div><p>=A0<u></u><u></u></p></div=
><p>-- <br>Thanks &amp; Regards<br>Himanish <u></u><u></u></p></div></div><=
/div></div></div><p><br><br clear=3D"all">
<u></u><u></u></p><div><p><u></u>=A0<u></u></p></div><p>-- <br>Thanks &amp;=
 Regards<br>Himanish <u></u><u></u></p></div></div></div></div></div></bloc=
kquote></div><br><br clear=3D"all"><div><br></div>-- <br>
Thanks &amp; Regards<br>Himanish
</div></div></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
Thanks &amp; Regards<br>Himanish
</div></div>

--e0cb4efe2dac77c2fe04d9117c68--