Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: 
 <CAB1PEP+=q2iPp818o4f5-e1qukeQk2fHGDAD6SpDeKXBigntWA@mail.gmail.com>
References: 
 <CAB1PEP+=q2iPp818o4f5-e1qukeQk2fHGDAD6SpDeKXBigntWA@mail.gmail.com>
From: Ted Dunning <tdunning@maprtech.com>
Date: Thu, 28 Mar 2013 08:45:12 +0100
Message-ID: 
 <CAND0qzsj=_O61e32caLCJU=hFct6Ac_vyGTCWsa15NvzeyOLLg@mail.gmail.com>
Subject: Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput
To: "common-user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=bcaec554d754dbdcb204d8f7565f

--bcaec554d754dbdcb204d8f7565f
Content-Type: text/plain; charset=ISO-8859-1

The EMR distributions have special versions of the s3 file system.  They
might be helpful here.

Of course, you likely aren't running those if you are seeing 5MB/s.

An extreme alternative would be to light up an EMR cluster, copy to it,
then to S3.


On Thu, Mar 28, 2013 at 4:54 AM, Himanish Kushary <himanish@gmail.com>wrote:

> I am thinking either transferring individual folders instead of the entire
> 70 GB folders as a workaround or as another option increasing the "
> mapred.task.timeout" parameter to something like 6-7 hour ( as the avg
> rate of transfer to S3 seems to be 5 MB/s).Is there any other better
> option to increase the throughput for transferring bulk data from HDFS to
> S3 ?  Looking forward for suggestions.
>

--bcaec554d754dbdcb204d8f7565f
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_extra" style>The EMR distributions hav=
e special versions of the s3 file system. =A0They might be helpful here.</d=
iv><div class=3D"gmail_extra" style><br></div><div class=3D"gmail_extra" st=
yle>Of course, you likely aren&#39;t running those if you are seeing 5MB/s.=
</div>

<div class=3D"gmail_extra" style><br></div><div class=3D"gmail_extra" style=
>An extreme alternative would be to light up an EMR cluster, copy to it, th=
en to S3.</div><div class=3D"gmail_extra"><br></div><div class=3D"gmail_ext=
ra">

<br><div class=3D"gmail_quote">On Thu, Mar 28, 2013 at 4:54 AM, Himanish Ku=
shary <span dir=3D"ltr">&lt;<a href=3D"mailto:himanish@gmail.com" target=3D=
"_blank">himanish@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"g=
mail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-l=
eft:1ex">

<div style=3D"color:rgb(34,34,34);font-size:13px;font-family:arial,sans-ser=
if">I am thinking either transferring individual folders instead of the ent=
ire 70 GB folders as a workaround or as another option increasing the=A0&qu=
ot;<span style=3D"font-family:Consolas,Menlo,Monaco,&#39;Lucida Console&#39=
;,&#39;Liberation Mono&#39;,&#39;DejaVu Sans Mono&#39;,&#39;Bitstream Vera =
Sans Mono&#39;,&#39;Courier New&#39;,monospace,serif;font-size:14px;line-he=
ight:18px">mapred.task.timeout&quot;=A0</span><span style=3D"line-height:18=
px"><font face=3D"arial, helvetica, sans-serif">parameter to something like=
 6-7 hour ( as the avg rate of transfer to S3 seems to be 5 MB/s)</font><fo=
nt face=3D"Consolas, Menlo, Monaco, Lucida Console, Liberation Mono, DejaVu=
 Sans Mono, Bitstream Vera Sans Mono, Courier New, monospace, serif">.</fon=
t></span>Is there any other better option to increase the throughput for tr=
ansferring bulk data from HDFS to S3 ? =A0Looking forward for suggestions.<=
/div>

</blockquote></div><br><br></div></div>

--bcaec554d754dbdcb204d8f7565f--