Fair enough. I just wanted to point out that doing it via a script is going to be a million times faster to implement compared to something like Flume (and arguably more reliable too with no maintenance overhead). Don't get me wrong, we use Flume for our data collection as well but our use case is real time/online data collection and Flume does the job well. So nothing against Flume per se. I was just thinking - if a script becomes a pain down the road how much throw away effort are we talking about here, a few minutes to a few hours at max vs what happens if Flume becomes a pain, a few days to a few weeks of throw away work.

Sent from Yahoo Mail on Android

From: Kaalu Singh <kaalusingh1234@gmail.com>;
To: <user@hadoop.apache.org>; Dhaval Shah <prince_mithibai@yahoo.co.in>;
Subject: Re: Question about Flume
Sent: Wed, Jan 22, 2014 11:20:52 PM

The closest built-in functionality to the use case I have is the "Spooling Directory Source" and I like the idea of using/building software with higher level languages like Java for reasons of extensibility etc (and don't like the idea of scripts).

However, I am soliciting opinions and can be swayed to change my mind.

Thanks for your response Dhaval - appreciate it.


On Wed, Jan 22, 2014 at 2:58 PM, Dhaval Shah <prince_mithibai@yahoo.co.in> wrote:
Flume is useful for online log aggregation in a streaming format. Your use case seems more like a batch format where you just need to grab the file and put it in HDFS at regular intervals which can be much more easily  achieved by a bash script running on a cron'd basis. 



From: Kaalu Singh <kaalusingh1234@gmail.com>
To: user@hadoop.apache.org
Sent: Wednesday, 22 January 2014 5:52 PM
Subject: Question about Flume


I have the following use case:

I have data files getting generated frequently on a certain machine, X. The only way I can bring them into my Hadoop cluster  is by SFTPing at certain intervals of time and getting them and landing them in HDFS. 

I am new to Hadoop and to Flume. I read up about Flume and it seems like this framework is appropriate for something like this although I did not see an available 'source' that can do exactly what I am looking for. Unavailability of a 'source' plugin is not a deal breaker for me as I can write one but first I want to make sure this is the right way to go. So, my questions are:

1. What are the pros/cons of using Flume for this use case?
2. Does anybody know of a source plugin that does what I am looking for?
3. Does anybody think I should not use Flume and instead write my own application to achieve this use case?