Fair enough. I just wanted to point out that doing it via a script is going to be a million times faster to implement compared to something like Flume (and arguably more reliable too with no maintenance overhead). Don't get me wrong, we use Flume for our data collection as well but our use case is real time/online data collection and Flume does the job well. So nothing against Flume per se. I was just thinking - if a script becomes a pain down the road how much throw away effort are we talking about here, a few minutes to a few hours at max vs what happens if Flume becomes a pain, a few days to a few weeks of throw away work.
The closest built-in functionality to the use case I have is the "Spooling Directory Source" and I like the idea of using/building software with higher level languages like Java for reasons of extensibility etc (and don't like the idea of scripts).
However, I am soliciting opinions and can be swayed to change my mind.
Thanks for your response Dhaval - appreciate it.
On Wed, Jan 22, 2014 at 2:58 PM, Dhaval Shah <email@example.com> wrote:
Flume is useful for online log aggregation in a streaming format. Your use case seems more like a batch format where you just need to grab the file and put it in HDFS at regular intervals which can be much more easily achieved by a bash script running on a cron'd basis.