flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hari Shreedharan <hshreedha...@cloudera.com>
Subject Re: Architectural questions
Date Wed, 14 Aug 2013 21:23:54 GMT
Flume v1.3.0 had a major performance issue which is why 1.3.1 was released immediately after.
The current stable release is 1.4.0 - so you should use that. 

1. Can you detail this point? Channel to Sink should really not have any exceptions - if the
sink or a plugin the sink is using is causing rollbacks, then that should handle the failure
cases/drop events etc.  The channel is pretty much a passive component just like a queue -
"bad events" are events sinks cannot handle due to some reason. The logic of handling this
should be in the sink itself.

2. Currently that is not an option, but if you need it, chances are there are others who do
too. Explain your use-case in a jira. Remember, Flume is not a file streaming system, it is
an event streaming one, so each file is still converted into events by Flume.

3. If you think the current deserializers don't fit your use-case, you can easily write your
own and drop it in. 


On Wednesday, August 14, 2013 at 1:58 PM, Robert Heise wrote:

> Hello,
> As I continue to ramp up using Apache Flume (v1.3.0), I have observed a few challenges
and hoping somebody who has more experience can shed some light.  
> 1. Establishing a data pipeline is trivial, what I have noticed is that any exceptions
caught from the channel->sink operation invoke what appears to be a repeating cycle of
exceptions.  As an example, any events which cause an exception (java stacktrace) put the
agent into a tailspin.  There are no tools for managing the pipeline to identify culprit events/files,
stopping, purging the channel, introspecting the pipeline etc.  The best course of action
is to purge everything under file-channel and restart the agent.  I've read several posts
posturing that using regex interceptors could be a potential fix, but it is almost impossible
to predict, in a production environment, what exceptions are going to occur.  In my opinion,
there has to be a declarative manner to move bad events out of the channel to a "dead-letter-queue"
or equivalent. 
> 2.  I was hoping that the Spooling Directory Source would help us capture file metadata,
but nothing ever appears in the default .flumespool trackerDir option?
> 3. Maybe my use case is not the right fit for Flume, but my largest design constraint
is that we deal with files, everything we do is based on files.  I was hoping that the spooldir
and batch control options would provide an intuitive way to process files incoming to a spooldirectory,
and ultimately land that same data to HDFS.  However, a file with 470,000 lines is creating
over 52MM events and because the tooling is week, I have no visibility into why that many
events are being created, where the agent is in respect to completing.  The data flow architecture
is perfect, but maybe Flume is best used for logs, tailing of files, etc, not necessarily
processing files?
> Thanks

View raw message