flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Percy <mpe...@apache.org>
Subject Re: Flume latency issue
Date Fri, 28 Sep 2012 22:01:07 GMT
Hi MK,
Based on a quick look @ the code, it looks like the RollingFileSink doesn't
short-circuit on batches like most of the rest of the sinks do. I would
consider that a minor bug and should be fixed.

So basically it will wait until it can pull batchSize events off the queue
before pushing the data onto the file system.

Regards,
Mike

On Thu, Sep 27, 2012 at 9:23 PM, Karthikeyan Muthukumarasamy <
mkarthik.here@gmail.com> wrote:

> Hi Mike,
> Thanks for the response!
> Im use flume-ng-1.2.0 version.
>
> In the prototype that Im building, the final consolidated sink writes to a
> file. I intend to extend this with more specific sinks like HBase, JMX etc.
> While Im writing this mail, I get a doubt if the latency is caused by some
> buffering happening at the final FILE_ROLL sink!
> I have test scripts loading messages with timestamp every one second to
> the files hblog, zklog & applog.
> I expect the final consolidated sink's output to be something like this:
> 10:00:00 hblaselog entry
> 10:00:00 zklog entry
> 10:00:00 applog entry
> 10:00:01 hblaselog entry
> 10:00:01 zklog entry
> 10:00:01 applog entry
> 10:00:02 hblaselog entry
> 10:00:02 zklog entry
> 10:00:02 applog entry
> combos like above...
>
> But in reality, some batching seems to be occuring in between and once
> every 10 secs, I see that the consolidated sink writes chunks (from each
> src) to the output file as follows:
> 10:00:00 hblaselog entry
> 10:00:01 hblaselog entry
> 10:00:02 hblaselog entry
> (17 more like this)
> delay...
> 10:00:00 zklog entry
> 10:00:01 zklog entry
> 10:00:02 zklog entry
> (17 more like this)
> delay...
> 10:00:00 applog entry
> 10:00:01 applog entry
> 10:00:02 applog entry
>
> My Flume conf file is as below:
> # example.conf: A single-node Flume configuration
>
> # Name the components on this agent
> agent1.sources = hbase-src zk-src app-src consolidated-src
> agent1.sinks = hbase-sink zk-sink app-sink consolidated-sink
> agent1.channels = hbase-chn zk-chn app-chn consolidated-chn
>
> # All channels are in-memory channels
> agent1.channels.hbase-chn.type = memory
> agent1.channels.zk-chn.type = memory
> agent1.channels.app-chn.type = memory
> agent1.channels.consolidated-chn.type = memory
>
> # Describe/configure hbase-src
> agent1.sources.hbase-src.type = exec
> agent1.sources.hbase-src.command = tail -F /home/efhjlns/scripts/hblog
> agent1.sources.hbase-src.channels = hbase-chn
>
> # Describe avro hbase-sink
> agent1.sinks.hbase-sink.type = avro
> agent1.sinks.hbase-sink.hostname = localhost
> agent1.sinks.hbase-sink.port = 15001
> agent1.sinks.hbase-sink.channel = hbase-chn
>
> # Describe/configure zk-src
> agent1.sources.zk-src.type = exec
> agent1.sources.zk-src.command = tail -F /home/efhjlns/scripts/zklog
> agent1.sources.zk-src.channels = zk-chn
>
> # Describe avro zk-sink
> agent1.sinks.zk-sink.type = avro
> agent1.sinks.zk-sink.hostname = localhost
> agent1.sinks.zk-sink.port = 15001
> agent1.sinks.zk-sink.channel = zk-chn
>
>
> # Describe/configure app-src
> agent1.sources.app-src.type = exec
> agent1.sources.app-src.command = tail -F /home/efhjlns/scripts/applog
> agent1.sources.app-src.channels = app-chn
>
> # Describe avro app-sink
> agent1.sinks.app-sink.type = avro
> agent1.sinks.app-sink.hostname = localhost
> agent1.sinks.app-sink.port = 15001
> agent1.sinks.app-sink.channel = app-chn
>
> # Describe/configure consolidated-src
> agent1.sources.consolidated-src.type = avro
> agent1.sources.consolidated-src.bind = localhost
> agent1.sources.consolidated-src.port = 15001
> agent1.sources.consolidated-src.channels = consolidated-chn
>
> # Describe consolidated file sink
>
> agent1.sinks.consolidated-sink.type = FILE_ROLL
> agent1.sinks.consolidated-sink.sink.directory = /home/efhjlns/flume-opt-dir
> agent1.sinks.consolidated-sink.sink.rollInterval = 0
> agent1.sinks.consolidated-sink.channel = consolidated-chn
>
> Thanks & Regards
> MK
>
>
> On Fri, Sep 28, 2012 at 2:38 AM, Mike Percy <mpercy@apache.org> wrote:
>
>> MK,
>> Which version of Flume are you using?
>>
>> This design sounds reasonable to me.
>>
>> Regarding your question about 20 second latency, that should not be true.
>> Can you please confirm that you have observed this? If you have, please
>> attach your flume.conf and we can take a look. But it should be nearly
>> instantaneous - the batch sizes are controlled by the client or sink, and
>> in general the sink takes as much as it can from the channel, up to the
>> batch size, but if it takes less we continue immediately regardless. The
>> only time we "back off" is when the channel is empty *and* no events were
>> taken in the current batch - then the sink runner goes into an exponential
>> backoff.
>>
>> Regards,
>> Mike
>>
>>
>> On Thu, Sep 27, 2012 at 6:55 AM, Karthikeyan Muthukumarasamy <
>> mkarthik.here@gmail.com> wrote:
>>
>>> Hi,
>>> In my project various applications and 3PPs write log into to their
>>> separate logfiles.
>>> There are two limitations with this:
>>> - the structure of the log messages are different in each log file
>>> - the log messages are in different files and I cant get a single time
>>> sorted display of all log messages, which is important in some debug
>>> situations
>>>
>>> As a solution to this problem, I intend to:
>>> - use separate flume sources to tail various log files in the system
>>> - have interceptors for each type of flume source and convert all log
>>> messages to a common structure
>>> - all flume sinks will write to a localhost avro port
>>> - a separate flume source will read from the avro port on localhost
>>> - there will be a fan-out logic to post the data from that source to
>>> multiple channels
>>> - each connel is connected to a separate sink like JMX sink, HBase Sink
>>> etc
>>>
>>> First of all, is this kind of usage of flume acceptable and is there
>>> anything I need to specifically take care of?
>>>
>>> I also notice that the consolidated avro source which reads data from
>>> avro port gets data only as blocks from each source, the latency is around
>>> 20 seconds. Is it possible to reduce this latency, so at the consolidated
>>> avro source, I receive all events as they are getting logged into their log
>>> files, instantaneously?
>>>
>>> Thanks in advance!
>>> MK
>>>
>>>
>>
>

Mime
View raw message