flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sharninder <sharnin...@gmail.com>
Subject Re: Newbie - Sink question
Date Fri, 05 Sep 2014 05:01:06 GMT
Yes, sink seems like the right place to put the CSV-S3 code. Don't mess
with the channel code unless you know what you're doing. Although since
you're doing db lookups, I'd imagine that would slow down the whole channel
depending on the source data rate. What I'd suggest is that you take a look
at how interceptors work and/or maybe take a look at the morphline sdk (
http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/
).

Keep the source for only reading files and sink for only writing files.
Everything else in the interceptor/morphline.

--
Sharninder



On Fri, Sep 5, 2014 at 8:23 AM, Kevin Warner <kevinwarner7965@gmail.com>
wrote:

> Hello All,
> We have the following configuration:
> Source->Channel->Sink
>
> Now, the source is pointing to a folder that has lots of json files. The
> channel is file based so that there is fault tolerance and the Sink is
> putting CSV files on S3.
>
> Now, there is code written in Sink that takes the JSON events and does
> some MySQL database lookup and generates CSV files to be put into S3.
>
> The question is, is it the right place for the code or should the code be
> running in channel as the ACID gaurantees is present in Channel. Please
> advise.
>
> -Kev
>
>

Mime
View raw message