flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gary Malouf <malouf.g...@gmail.com>
Subject Re: Flume log rolling when you need to do rollups for multiple time zones
Date Tue, 29 Jul 2014 16:01:44 GMT
Hi Hari,

Below is the config for one of our source-channel-sink combos.  In
hadoop/spark world, how do you then handle the events that arrive late to
the bucket?  That is, events for July 15 UTC end up in the July 16 bucket.
 The ugly way I have been handling this to date is that for any query for a
particular day I attempt to grab the next UTC day in addition to the
current one (per my path config in previous email).


agent-1.channels.imp-ch1.type = file
agent-1.channels.imp-ch1.checkpointDir = /opt/flume/imp-ch1/checkpoint
agent-1.channels.imp-ch1.dataDirs = /opt/flume/imp-ch1/data
agent-1.channels.imp-ch1.capacity = 100000000

agent-1.sources.avro-imp-source1.channels = imp-ch1
agent-1.sources.avro-imp-source1.type = avro
agent-1.sources.avro-imp-source1.bind = 0.0.0.0
agent-1.sources.avro-imp-source1.port = 41414

agent-1.sources.avro-imp-source1.interceptors = host1 timestamp1
agent-1.sources.avro-imp-source1.interceptors.host1.type = host
agent-1.sources.avro-imp-source1.interceptors.host1.useIP = false
agent-1.sources.avro-imp-source1.interceptors.timestamp1.type = timestamp


agent-1.sinks.hdfs-imp-sink1.channel = imp-ch1
agent-1.sinks.hdfs-imp-sink1.type = hdfs
agent-1.sinks.hdfs-imp-sink1.hdfs.path =
hdfs://nn-01:8020/impressions/yr=%Y/mo=%m/d=%d/
agent-1.sinks.hdfs-imp-sink1.hdfs.filePrefix = %{host}
agent-1.sinks.hdfs-imp-sink1.hdfs.batchSize = 25
agent-1.sinks.hdfs-imp-sink1.hdfs.rollInterval = 3600
agent-1.sinks.hdfs-imp-sink1.hdfs.rollCount = 0
agent-1.sinks.hdfs-imp-sink1.hdfs.rollSize = 0




On Tue, Jul 29, 2014 at 11:17 AM, Hari Shreedharan <
hshreedharan@cloudera.com> wrote:

> Can you send your config? There are a couple of params that allow the
> files to be rolled faster - idleTimeout and rollInterval. I am assuming you
> are using rollInterval already. idleTimeout will close a file when it is
> not written to for the configured time. That might help with the rolling.
> Remember though that if events arrive "late" for a bucket due to failures
> or network issues, new files will be opened in that bucket if none are
> currently open.
>
>
> On Tue, Jul 29, 2014 at 7:29 AM, Gary Malouf <malouf.gary@gmail.com>
> wrote:
>
>> We are an ad tech company that buys and sells digital media.  To date, we
>> have been using Apache Flume 1.4.x to ingest all of our bid request,
>> response, impression and attribution data.
>>
>> The logs currently 'roll' hourly for each data type, meaning that at some
>> point during each hour (if Flume is behaving) the tmp file in HDFS is
>> closed/renamed with a new one being opened.  This is done for each of 5
>> running Flume instances.
>>
>> One problem that has been a challenge to date is effectively bounding our
>> data queries to make sure we capture all of the data for a given interval
>> without pulling in the world.  To date, our structure (all in UTC) for each
>> data type is:
>>
>> /datatype/yr=2014/mo=06/d=15/{files}
>>
>> The challenge for us is that Flume is not perfect.
>>
>> 1) It can and will often write data that came in on the new UTC day into
>> the previous one if that log file has not rolled yet.
>>
>> 2) Since it does not roll perfectly at the top of each hour, we are
>> having trouble determining the best way to tightly bound a query for data
>> that is within a few [3-6] hour window properly.
>>
>> 3) When we are doing data rollups in timezones other than UTC, we end up
>> reading in all of the data for both UTC containing that data to be on the
>> safe-side.  It would be nice to bound this as described in (2).
>>
>>
>> One of the major problems affecting the first two cases is that Flume
>> sometimes gets 'stuck' - that is, the data will hang out in the file
>> channel for longer than we anticipate.
>>
>> Anyway, I was just wondering how others have approached these problems to
>> date.  If not for the edge cases when data can get stuck in Flume, I think
>> this would be straightforward.
>>
>>
>

Mime
View raw message