hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohan Radhakrishnan <radhakrishnan.mo...@gmail.com>
Subject Re: Managed File Transfer
Date Wed, 09 Jul 2014 16:11:28 GMT
I am a beginner. But this seems to be similar to what I intend. The data
source will be external FTP or S3 storage.

"Spark Streaming can read data from HDFS
,Flume <http://flume.apache.org/>, Kafka <http://kafka.apache.org/>, Twitter
<https://dev.twitter.com/> and ZeroMQ <http://zeromq.org/>. You can also
define your own custom data sources."


On Wed, Jul 9, 2014 at 2:09 PM, Stanley Shi <sshi@gopivotal.com> wrote:

> There's a DistCP utility for this kind of purpose;
> Also there's "Spring XD" there, but I am not sure if you want to use it.
> Regards,
> *Stanley Shi,*
> On Mon, Jul 7, 2014 at 10:02 PM, Mohan Radhakrishnan <
> radhakrishnan.mohan@gmail.com> wrote:
>> Hi,
>>            We used a commercial FT and scheduler tool in clustered mode.
>> This was a traditional active-active cluster that supported multiple
>> protocols like FTPS etc.
>>     Now I am interested in evaluating a Distributed way of crawling FTP
>> sites and downloading files using Hadoop. I thought since we have to
>> process thousands of files Hadoop jobs can do it.
>> Are Hadoop jobs used for this type of file transfers ?
>> Moreover there is a requirement for a scheduler  also. What is the
>> recommendation of the forum ?
>> Thanks,
>> Mohan

View raw message