flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Alten-Lorenz <wget.n...@gmail.com>
Subject Re: Of BatchSize / Channel Capacity / Transaction Capacity
Date Sat, 12 Jan 2013 09:05:45 GMT
Published in our wiki:
https://cwiki.apache.org/confluence/display/FLUME/BatchSize,+ChannelCapacity+and+ChannelTransactionCapacity+Properties

- Alex

On Jan 11, 2013, at 6:03 PM, Jeff Lord <jlord@cloudera.com> wrote:

> Bhaskar,
> 
> I have created the following jira for this:
> https://issues.apache.org/jira/browse/FLUME-1829
> 
> -Jeff
> 
> 
> On Fri, Jan 11, 2013 at 6:48 AM, Bhaskar V. Karambelkar <bhaskarvk@gmail.com
>> wrote:
> 
>> Thanks Jeff,
>> Clear and detailed explanations. These deserve to be on the wiki, as these
>> parameters have direct implications on the performance of flume nodes.
>> 
>> thanks
>> Bhaskar
>> 
>> 
>> On Tue, Jan 8, 2013 at 9:40 PM, Jeff Lord <jlord@cloudera.com> wrote:
>> 
>>> Hi Bashkar,
>>> 
>>> 1) Batch Size
>>>  1.a) When configured by client code using the flume-core-sdk , to send
>>> events to flume avro source.
>>> The flume client sdk has an appendBatch method. This will take a list of
>>> events and send them to the source as a batch. This is the size of the
>>> number of events to be passed to the source at one time.
>>> 
>>>  1.b) When set as a parameter on HDFS sink (or other sinks which support
>>> BatchSize parameter)
>>> This is the number of events written to file before it is flushed to HDFS
>>> 
>>> 2)
>>>  2.a) Channel Capacity
>>> This is the maximum capacity number of events of the channel.
>>> 
>>>  2.b) Channel Transaction Capacity.
>>> This is the max number of events stored in the channel per transaction.
>>> 
>>> How will setting these parameters to different values, affect throughput,
>>> latency in event flow?
>>> 
>>> In general you will see better throughput by using memory channel as
>>> opposed to using file channel at the loss of durability.
>>> 
>>> The channel capacity is going to need to be sized such that it is large
>>> enough to hold as many events as will be added to it by upstream agents.
>>> Ideal flow would see the sink draining events from the channel faster than
>>> it is having events added by its source.
>>> 
>>> The channel transaction capacity will need to be smaller than the channel
>>> capacity.
>>> e.g. If your Channel capacity is set to 10000 than Channel Transaction
>>> Capacity should be set to something like 100.
>>> 
>>> Specifically if we have clients with varying frequency of event
>>> generation, i.e. some clients generating thousands of events/sec, while
>>> others at a much slower rate, what effect will different values of these
>>> params have on these clients ?
>>> 
>>> Transaction Capacity is going to be what throttles or limits how many
>>> events the source can put into the channel. This going to vary depending on
>>> how many tiers of agents/collectors you have setup.
>>> In general though this should probably be equal to whatever you have the
>>> batch size set to in your client.
>>> 
>>> With regards to the hdfs batch size, the larger your batch size the
>>> better performance will be. However, keep in mind that if a transaction
>>> fails the entire transaction will be replayed which could have the
>>> implication of duplicate events downstream.
>>> 
>>> -Jeff
>>> 
>>> 
>>> 
>>> 
>>> On Tue, Jan 8, 2013 at 10:46 AM, Bhaskar V. Karambelkar <
>>> bhaskarvk@gmail.com> wrote:
>>> 
>>>> Can some one explain the importance of the following
>>>> 1) Batch Size
>>>>  1.a) When configured by client code using the flume-core-sdk , to send
>>>> events to flume avro source.
>>>>  1.b) When set as a parameter on HDFS sink (or other sinks which
>>>> support BatchSize parameter)
>>>> 2)
>>>>  2.a) Channel Capacity
>>>>  2.b) Channel Transaction Capacity.
>>>> 
>>>> 
>>>> Under which conditions should these params be set to high values, and
>>>> under which conditions should they be set to low values.
>>>> 
>>>> 
>>>> How will setting these parameters to different values, affect
>>>> throughput, latency in event flow.
>>>> Specifically if we have clients with varying frequency of event
>>>> generation, i.e. some clients generating thousands of events/sec, while
>>>> others at a much slower rate, what effect will different values of these
>>>> params have on these clients ?
>>>> 
>>>> thanks
>>>> Bhaskar
>>>> 
>>> 
>>> 
>> 

--
Alexander Alten-Lorenz
http://mapredit.blogspot.com
German Hadoop LinkedIn Group: http://goo.gl/N8pCF


Mime
View raw message