flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Warner <kevinwarner7...@gmail.com>
Subject Re: Newbie - Sink question
Date Sun, 07 Sep 2014 17:37:41 GMT
I am getting push back from one of the engineers who works for a friend of
mine. Can you please take a look at his reply below and let me know what do
you guys think:

I really have nothing against Morphline as it seems to be driving Apache
Flume in the right direction, but I still stand by my point that Morpline
in it's current stage of maturity can't be used in our case.

I don't know if you have noticed that Flume Interceptor runs as a Flume
Source process according to this sentence from Morphline Interceptor
limited documentation:
/Currently, there is a restriction in that the morphline of an interceptor
must not generate more than one output record for each input event. This
interceptor is not intended for heavy duty ETL processing - if you need
this consider moving ETL processing from the Flume Source to a Flume Sink,
e.g. to a MorphlineSolrSink.

Given that, they obviously intended Flume Sink to be heavy lifter, as
implementation in the interceptor will slow Flume Source down.
Also, there is only one Flume Sink implementation of Morphline intended to
pass data to Solr (see this

Of course, we could create our own Morphline Sink as there is some
documentation on using Morpline libraries in the Java code.

Please advise.


On Thu, Sep 4, 2014 at 11:08 PM, Ashish <paliwalashish@gmail.com> wrote:

> I would recommend using an Interceptor for this and possibly a modified
> Flume topology. If the json files have large numbers of rows or very high
> number of files, go for a Collection tier, and use another level of agents
> that uses interceptors for DB lookup and CSV generation. Something like
> Collection Agents -> Transformation Agents (writing to S3 Sinks)
> You can scale out Transformation/Collection layer agents  based on the
> traffic volume
> thanks
> On Fri, Sep 5, 2014 at 8:23 AM, Kevin Warner <kevinwarner7965@gmail.com>
> wrote:
>> Hello All,
>> We have the following configuration:
>> Source->Channel->Sink
>> Now, the source is pointing to a folder that has lots of json files. The
>> channel is file based so that there is fault tolerance and the Sink is
>> putting CSV files on S3.
>> Now, there is code written in Sink that takes the JSON events and does
>> some MySQL database lookup and generates CSV files to be put into S3.
>> The question is, is it the right place for the code or should the code be
>> running in channel as the ACID gaurantees is present in Channel. Please
>> advise.
>> -Kev
> --
> thanks
> ashish
> Blog: http://www.ashishpaliwal.com/blog
> My Photo Galleries: http://www.pbase.com/ashishpaliwal

View raw message