flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Johnson <st...@webninja.com>
Subject Re: Would someone please comment on Tail Source in NG?
Date Wed, 05 Sep 2012 14:26:51 GMT
Patrick, thanks, this could be cool for testing, but I'm planning on using
an lf4j avro logger to send straight to an avro source anyway.  That's the
hope at least.

The perl script was just something i used to bench-test the framework
itself.  But ideally, we want to avoid logging to files at all and use
Flume.  However, I will be bench-testing the avro stuff and seeing how it
performs for us, if it's not what I'm looking for, I may be interested in
other options.

On Sun, Sep 2, 2012 at 4:45 PM, Patrick Wendell <pwendell@gmail.com> wrote:

> Hey Chris - what Steve said is right on:
>
> "Unless you can always guarantee that you will always be able to
> continue where you left off and never re-send data then it's probably
> best to go right to the logging source and have that piece send
> directly to flume (ie, avro, lf4j plugins etc.)."
>
> If you are using an asynchronous source, like tailing, there is always
> a possibility of data loss. What if the disk that the log is stored on
> fails before flume gets to it? This failure window is inherent in
> trying to collect logs like this - and that is what the warning is
> speaking to.
>
> Steve - I am working on a tool to read through rolled log files on
> disk, send them to a Flume agent, and then rename or delete the
> files... would be interested to hear whether you think this could
> displace your current perl setup in terms of functionality.
>
> - Patrick
>
> On Thu, Aug 30, 2012 at 8:06 AM, Steve Johnson <steve@webninja.com> wrote:
> > Chris, I'm testing something similar from the sounds of it.  We were
> > originally going to go with the idea of using some sort of log tailer to
> > pass events (log recs) into the flume agent.  Right now, I'm testing
> using a
> > simple perl script that reads a rotated log file, and sends them over the
> > network to a flume agent using the NetCat source.  This is not ideal,
> but is
> > good enough for some initial Flume testing, which right now, I'm just
> trying
> > to stress test the system.
> >
> > When you think about it, the nature of tailing logs is that you really
> can't
> > guarantee delivery anyway.  For instance, what happens if you need to
> take
> > your server down, or the tailer fails and you need to restart it,  where
> > were you at in tailing the log?  In my case, it is as bad or worse for
> us to
> > duplicate a logrec as it is to miss them.  So tailing itself is a tricky
> > thing.  Unless you can always guarantee that you will always be able to
> > continue where you left off and never re-send data then it's probably
> best
> > to go right to the logging source and have that piece send directly to
> flume
> > (ie, avro, lf4j plugins etc.).  However, the downfall there is that if
> the
> > flume agent goes down, your app generating the logs should as well to
> ensure
> > you don't process requests that you can't keep a record of, or at least
> > write it smart enough to fall-back to a file when that happens so that
> you
> > can recover them in a batch process later.
> >
> > However, if your using this for something like sysloging, error logs,
> > monitoring, it's probbaly not that critical if you duplicated or missed
> some
> > logrecs for a short time after a recovery.  I guess it really depends on
> the
> > application.  I'll be interested to hear your solution though for this,
> as
> > I'm still in the process myself.
> >
> > Thanks
> >
> >
> > On Thu, Aug 30, 2012 at 9:45 AM, Chris Neal <cwneal@gmail.com> wrote:
> >>
> >> Hi Patrick,
> >>
> >> My issue with ExecSource is the giant warning in the user guide:
> >>
> >> "
> >>
> >> Warning
> >>
> >>
> >>
> >> The problem with ExecSource and other asynchronous sources is that the
> >> source can not guarantee that if there is a failure to put the event
> into
> >> the Channel the client knows about it. In such cases, the data will be
> lost.
> >> As a for instance, one of the most commonly requested features is the
> tail
> >> -F [file]-like use case where an application writes to a log file on
> disk
> >> and Flume tails the file, sending each line as an event. While this is
> >> possible, there’s an obvious problem; what happens if the channel fills
> up
> >> and Flume can’t send an event? Flume has no way of indicating to the
> >> application writing the log file that it needs to retain the log or
> that the
> >> event hasn’t been sent, for some reason. If this doesn’t make sense, you
> >> need only know this: Your application can never guarantee data has been
> >> received when using a unidirectional asynchronous interface such as
> >> ExecSource! As an extension of this warning - and to be completely
> clear -
> >> there is absolutely zero guarantee of event delivery when using this
> source.
> >> You have been warned."
> >>
> >>
> >>
> >> "zero guarantee of event delivery" is a bit scary for a production
> system.
> >> :)  This is what I'm currently using, and have to figure out a way to
> >> determine if events were dropped due to exceptions such as the one noted
> >> above (I'd love to hear some thoughts on this, btw!).  AFAIK, this was
> the
> >> best way to accomplish the tail -F use case.  Maybe I'm overly concerned
> >> about this reliability aspect, but after reading that paragraph, it sure
> >> left me with the impression that ExecSource was not the source of
> choice for
> >> guaranteed delivery.
> >>
> >> One of our requirements was to not have to make modifications to every
> >> application that we wanted to get into HDFS, which is why Flume was an
> >> obvious choice!  Putting Flume inside the application was not an
> acceptable
> >> solution given that requirement, unfortunately.
> >>
> >> I am not familiar with the asynchronous log spooler.  Please point me to
> >> some links!  I thought I had investigated all possibilities. :)
> >>
> >> I didn't realize the inode limitation in Java.  That does make things
> >> "difficult" to say the least.  For our immediate needs, I'll stick with
> >> ExecSource, but look at doing a client implementation in C or Python and
> >> pass events into an AVRO source within the agent.
> >>
> >> Thanks so much for everyone's time and comments!
> >> Chris
> >>
> >> On Thu, Aug 30, 2012 at 12:07 AM, Patrick Wendell <pwendell@gmail.com>
> >> wrote:
> >>>
> >>> Hey Chris,
> >>>
> >>> I'm not clear what functionality you would want from the TailSource
> >>> could offer that's not already offered by (a) using ExecSource (b)
> >>> putting flume inside your application or (c) using the asyncronous log
> >>> spooler that I am working on.
> >>>
> >>> It's impossible to correctly "watch" a file from within the JVM across
> >>> application restarts. For instance, if the file is renamed, swapped,
> >>> or mdified while the JVM is down (as is common with rolling logs),
> >>> there is no way to know whether the old and new file are the same.
> >>>
> >>> Within the bounds of what *is* possible, I'd say we have the use cases
> >>> pretty much covered, but I'm open to debate if I've missed something.
> >>>
> >>> - Patrick
> >>>
> >>> On Wed, Aug 29, 2012 at 6:51 PM, Juhani Connolly
> >>> <juhani_connolly@cyberagent.co.jp> wrote:
> >>> > Hi Chris,
> >>> >
> >>> > A few months back I actually ported the original flumes tail source,
> >>> > but it
> >>> > was decided(and I agree with the reasoning) not to include it for a
> >>> > number
> >>> > of reasons, which can be seen on the original ticket at
> >>> > https://issues.apache.org/jira/browse/FLUME-931 . One of the big
> ones
> >>> > is the
> >>> > fact that java cannot access inode information.
> >>> >
> >>> > What we do is have a python program that tracks the files in a
> >>> > directory and
> >>> > then sends the data using the scribe format to the ScribeSource(we
> were
> >>> > using scribe until switching to flume, so are just using our ingest
> >>> > system
> >>> > from then). This allows for the freedom to customize the ingest to
> our
> >>> > own
> >>> > expectations, and we write checkpoints of how far we have tailed. You
> >>> > could
> >>> > write this in whatever language you're comfortable with and pass the
> >>> > data
> >>> > via avro or thrift.
> >>> >
> >>> >
> >>> > On 08/30/2012 01:18 AM, Chris Neal wrote:
> >>> >
> >>> > Hey guys,
> >>> >
> >>> > I'm sure this is not a new question, but I haven't found an answer
in
> >>> > my
> >>> > searches.  I'm curious why there is as of yet no Tail Source with NG?
> >>> > It
> >>> > seems one of the most common use cases for Flume is to tail a log
> file
> >>> > and
> >>> > dump it "somewhere".  Given that, it sure would seem that a Tail
> Source
> >>> > would be one of the first sources that gets written with a new
> version.
> >>> >
> >>> > I know about all the other ways to implement something *like* a Tail
> >>> > Source:
> >>> > Exec Source, AVRO, Log4Jappender...  and unfortunately they all have
> >>> > limitations with regards to either functionality or
> >>> > reliability/recoverability.
> >>> >
> >>> > What am I missing here?
> >>> >
> >>> > Is there any work being done on a Tail Source for NG?
> >>> >
> >>> > I promise I'm not complaining, just trying to understand the logic.
> :)
> >>> >
> >>> > Much appreciated.
> >>> > Chris
> >>> >
> >>> >
> >>
> >>
> >
> >
> >
> > --
> > Steve Johnson
> > steve@webninja.com
>



-- 
Steve Johnson
steve@webninja.com

Mime
View raw message