flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hari Shreedharan <hshreedha...@cloudera.com>
Subject Re: Flume log rolling when you need to do rollups for multiple time zones
Date Tue, 29 Jul 2014 15:17:52 GMT
Can you send your config? There are a couple of params that allow the files
to be rolled faster - idleTimeout and rollInterval. I am assuming you are
using rollInterval already. idleTimeout will close a file when it is not
written to for the configured time. That might help with the rolling.
Remember though that if events arrive "late" for a bucket due to failures
or network issues, new files will be opened in that bucket if none are
currently open.

On Tue, Jul 29, 2014 at 7:29 AM, Gary Malouf <malouf.gary@gmail.com> wrote:

> We are an ad tech company that buys and sells digital media.  To date, we
> have been using Apache Flume 1.4.x to ingest all of our bid request,
> response, impression and attribution data.
> The logs currently 'roll' hourly for each data type, meaning that at some
> point during each hour (if Flume is behaving) the tmp file in HDFS is
> closed/renamed with a new one being opened.  This is done for each of 5
> running Flume instances.
> One problem that has been a challenge to date is effectively bounding our
> data queries to make sure we capture all of the data for a given interval
> without pulling in the world.  To date, our structure (all in UTC) for each
> data type is:
> /datatype/yr=2014/mo=06/d=15/{files}
> The challenge for us is that Flume is not perfect.
> 1) It can and will often write data that came in on the new UTC day into
> the previous one if that log file has not rolled yet.
> 2) Since it does not roll perfectly at the top of each hour, we are having
> trouble determining the best way to tightly bound a query for data that is
> within a few [3-6] hour window properly.
> 3) When we are doing data rollups in timezones other than UTC, we end up
> reading in all of the data for both UTC containing that data to be on the
> safe-side.  It would be nice to bound this as described in (2).
> One of the major problems affecting the first two cases is that Flume
> sometimes gets 'stuck' - that is, the data will hang out in the file
> channel for longer than we anticipate.
> Anyway, I was just wondering how others have approached these problems to
> date.  If not for the edge cases when data can get stuck in Flume, I think
> this would be straightforward.

View raw message