Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of radhakrishnan.mohan@gmail.com
 designates 209.85.223.177 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAOAr05t8KTZeSjkHQyaHfb+53hfgqVOxVmZpUjTyhT7GnQhGEw@mail.gmail.com>
References: 
 <CAOoXFP_1V7q_RetQ03fGwzEYQ9Tg8_heQhX-dR6ompL+hYg9Nw@mail.gmail.com>
	<CAOAr05t8KTZeSjkHQyaHfb+53hfgqVOxVmZpUjTyhT7GnQhGEw@mail.gmail.com>
Date: Wed, 9 Jul 2014 21:41:28 +0530
Message-ID: 
 <CAOoXFP9gGuZw=EY4pAamTrdQEC3jUO9gkw+sjZUF9KRV5qyXLg@mail.gmail.com>
Subject: Re: Managed File Transfer
From: Mohan Radhakrishnan <radhakrishnan.mohan@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=001a11346f68f54e6404fdc4f55e

--001a11346f68f54e6404fdc4f55e
Content-Type: text/plain; charset=UTF-8

I am a beginner. But this seems to be similar to what I intend. The data
source will be external FTP or S3 storage.

"Spark Streaming can read data from HDFS
<http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html>
,Flume <http://flume.apache.org/>, Kafka <http://kafka.apache.org/>, Twitter
<https://dev.twitter.com/> and ZeroMQ <http://zeromq.org/>. You can also
define your own custom data sources."

Thanks,
Mohan


On Wed, Jul 9, 2014 at 2:09 PM, Stanley Shi <sshi@gopivotal.com> wrote:

> There's a DistCP utility for this kind of purpose;
> Also there's "Spring XD" there, but I am not sure if you want to use it.
>
> Regards,
> *Stanley Shi,*
>
>
>
> On Mon, Jul 7, 2014 at 10:02 PM, Mohan Radhakrishnan <
> radhakrishnan.mohan@gmail.com> wrote:
>
>> Hi,
>>            We used a commercial FT and scheduler tool in clustered mode.
>> This was a traditional active-active cluster that supported multiple
>> protocols like FTPS etc.
>>
>>     Now I am interested in evaluating a Distributed way of crawling FTP
>> sites and downloading files using Hadoop. I thought since we have to
>> process thousands of files Hadoop jobs can do it.
>>
>> Are Hadoop jobs used for this type of file transfers ?
>>
>> Moreover there is a requirement for a scheduler  also. What is the
>> recommendation of the forum ?
>>
>>
>> Thanks,
>> Mohan
>>
>
>

--001a11346f68f54e6404fdc4f55e
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><font color=3D"#555555" face=3D"Helvetica Neue, Helve=
tica, Arial, sans-serif"><span style=3D"font-size:14.44444465637207px;line-=
height:20px">I am a beginner. But this seems to be similar to what I intend=
. The data source will be external FTP or S3 storage.</span></font></div>
<div><span style=3D"color:rgb(85,85,85);font-family:&#39;Helvetica Neue&#39=
;,Helvetica,Arial,sans-serif;font-size:14px;line-height:20px"><br></span></=
div><div><span style=3D"color:rgb(85,85,85);font-family:&#39;Helvetica Neue=
&#39;,Helvetica,Arial,sans-serif;font-size:14px;line-height:20px">&quot;Spa=
rk Streaming can read data from=C2=A0</span><a href=3D"http://hadoop.apache=
.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html" style=
=3D"color:rgb(47,164,231);text-decoration:none;font-family:&#39;Helvetica N=
eue&#39;,Helvetica,Arial,sans-serif;font-size:14px;line-height:20px;backgro=
und-image:initial;background-repeat:initial">HDFS</a><span style=3D"color:r=
gb(85,85,85);font-family:&#39;Helvetica Neue&#39;,Helvetica,Arial,sans-seri=
f;font-size:14px;line-height:20px">,</span><a href=3D"http://flume.apache.o=
rg/" style=3D"color:rgb(47,164,231);text-decoration:none;font-family:&#39;H=
elvetica Neue&#39;,Helvetica,Arial,sans-serif;font-size:14px;line-height:20=
px;background-image:initial;background-repeat:initial">Flume</a><span style=
=3D"color:rgb(85,85,85);font-family:&#39;Helvetica Neue&#39;,Helvetica,Aria=
l,sans-serif;font-size:14px;line-height:20px">,=C2=A0</span><a href=3D"http=
://kafka.apache.org/" style=3D"color:rgb(47,164,231);text-decoration:none;f=
ont-family:&#39;Helvetica Neue&#39;,Helvetica,Arial,sans-serif;font-size:14=
px;line-height:20px;background-image:initial;background-repeat:initial">Kaf=
ka</a><span style=3D"color:rgb(85,85,85);font-family:&#39;Helvetica Neue=
9;,Helvetica,Arial,sans-serif;font-size:14px;line-height:20px">,=C2=A0</spa=
n><a href=3D"https://dev.twitter.com/" style=3D"color:rgb(47,164,231);text-=
decoration:none;font-family:&#39;Helvetica Neue&#39;,Helvetica,Arial,sans-s=
erif;font-size:14px;line-height:20px;background-image:initial;background-re=
peat:initial">Twitter</a><span style=3D"color:rgb(85,85,85);font-family:=
9;Helvetica Neue&#39;,Helvetica,Arial,sans-serif;font-size:14px;line-height=
:20px">=C2=A0and=C2=A0</span><a href=3D"http://zeromq.org/" style=3D"color:=
rgb(47,164,231);text-decoration:none;font-family:&#39;Helvetica Neue&#39;,H=
elvetica,Arial,sans-serif;font-size:14px;line-height:20px;background-image:=
initial;background-repeat:initial">ZeroMQ</a><span style=3D"color:rgb(85,85=
,85);font-family:&#39;Helvetica Neue&#39;,Helvetica,Arial,sans-serif;font-s=
ize:14px;line-height:20px">. You can also define your own custom data sourc=
es.&quot;</span><br>
</div><div><span style=3D"color:rgb(85,85,85);font-family:&#39;Helvetica Ne=
ue&#39;,Helvetica,Arial,sans-serif;font-size:14px;line-height:20px"><br></s=
pan></div><div><span style=3D"color:rgb(85,85,85);font-family:&#39;Helvetic=
a Neue&#39;,Helvetica,Arial,sans-serif;font-size:14px;line-height:20px">Tha=
nks,</span></div>
<div><span style=3D"color:rgb(85,85,85);font-family:&#39;Helvetica Neue&#39=
;,Helvetica,Arial,sans-serif;font-size:14px;line-height:20px">Mohan</span><=
/div></div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On=
 Wed, Jul 9, 2014 at 2:09 PM, Stanley Shi <span dir=3D"ltr">&lt;<a href=3D"=
mailto:sshi@gopivotal.com" target=3D"_blank">sshi@gopivotal.com</a>&gt;</sp=
an> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">There&#39;s a DistCP utilit=
y for this kind of purpose;<div>Also there&#39;s &quot;Spring XD&quot; ther=
e, but I am not sure if you want to use it.</div>
</div><div class=3D"gmail_extra"><br clear=3D"all"><div><div dir=3D"ltr">
<div>Regards,</div><div><b>Stanley Shi,</b></div><img src=3D"http://www.gop=
ivotal.com/files/media/logos/pivotal-logo-email-signature.png"><br></div></=
div><div><div class=3D"h5">
<br><br><div class=3D"gmail_quote">On Mon, Jul 7, 2014 at 10:02 PM, Mohan R=
adhakrishnan <span dir=3D"ltr">&lt;<a href=3D"mailto:radhakrishnan.mohan@gm=
ail.com" target=3D"_blank">radhakrishnan.mohan@gmail.com</a>&gt;</span> wro=
te:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi,<div>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0We used a commercial FT and scheduler tool in clust=
ered mode. This was a traditional active-active cluster that supported mult=
iple protocols like FTPS etc.</div>

<div><br></div><div>=C2=A0 =C2=A0 Now I am interested in evaluating a Distr=
ibuted way of crawling FTP sites and downloading files using Hadoop. I thou=
ght since we have to process thousands of files Hadoop jobs can do it.</div=
>
<div><br></div><div>Are Hadoop jobs used for this type of file transfers ?<=
br></div><div><br></div><div>Moreover there is a requirement for a schedule=
r =C2=A0also. What is the recommendation of the forum ?</div><div><br></div=
>


<div><br></div><div>Thanks,</div><div>Mohan</div></div>
</blockquote></div><br></div></div></div>
</blockquote></div><br></div>

--001a11346f68f54e6404fdc4f55e--