flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Horrocks <ch...@hor.rocks>
Subject Re: Is it a good idea to use Flume Interceptor to process data?
Date Wed, 27 Jul 2016 12:27:37 GMT
Some rough initial thoughts:

This is interesting but you might need to elaborate on how you've achieved attaching an interceptor
to a channel (and why, in lieu of attaching it to the source):


we attach the Interceptor to the channel

Personally I'd have done this by feeding data into Spark Streaming and keeping flume as low
overhead as possible, particularily if it's monitoring data that's latency sensitive. For
storing the calculations variables for consumption by the interceptor I'd go with something
like ZooKeeper.



--
Chris Horrocks


On Wed, Jul 27, 2016 at 12:39 pm, Thanh Hong Dai <'hdthanh@tma.com.vn'> wrote:


Hi,







To give some background: We are currently buffering monitoring data into Kafka, where each
message in Kafka records several metrics at a point in time.



For each of the record, we need to perform some calculation based on the metrics in the record,
append the results (multiple of them) to the record and send the resulting record into a data
store (let’s call it DS1). All data required for the calculation are encapsulated in the
record, essentially making this an embarrassingly parallel problem.



The formula for the calculation is stored in a different data store (let’s call it DS2),
and can be changed (add/delete/modified by user). We are not required to react to the change
immediately, but we should do so in reasonable time (e.g. 5 minutes).







Currently, we have prototyped an implementation which implements the data processing as described
above in an Interceptor. We define the source as Kafka, the Sink as the sink for DS2, and
we attach the Interceptor to the channel. As described above, the Interceptor will be reading
the formula from DS1 regularly for any change, and will be responsible for processing the
data as they come in from Kafka.







We are aware of other streaming processing frameworks such as Spark of Kafka. However, the
implementation above is motivated by the fact that Flume has provided reliable streaming,
and we want to reuse as much code as possible.







Is this usage of Flume a good idea in term of performance and scalability?







Best regards,



Hong Dai Thanh.
Mime
View raw message