hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Lord (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-9198) Update Flume Wiki and User Guide to provide clearer explanation of BatchSize, ChannelCapacity and ChannelTransactionCapacity properties.
Date Fri, 11 Jan 2013 16:56:13 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-9198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jeff Lord updated HADOOP-9198:
------------------------------

    Component/s: documentation
    
> Update Flume Wiki and User Guide to provide clearer explanation of BatchSize, ChannelCapacity
and ChannelTransactionCapacity properties.
> ----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-9198
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9198
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: documentation
>            Reporter: Jeff Lord
>
> It would be good if we refined our wiki and user guide to help explain the following
in a more clear fashion:
> 1) Batch Size 
>   1.a) When configured by client code using the flume-core-sdk , to send events to flume
avro source.
> The flume client sdk has an appendBatch method. This will take a list of events and send
them to the source as a batch. This is the size of the number of events to be passed to the
source at one time.
>   1.b) When set as a parameter on HDFS sink (or other sinks which support BatchSize parameter)
> This is the number of events written to file before it is flushed to HDFS
> 2)
>   2.a) Channel Capacity
> This is the maximum capacity number of events of the channel.
>   2.b) Channel Transaction Capacity.
> This is the max number of events stored in the channel per transaction.
> How will setting these parameters to different values, affect throughput, latency in
event flow?
> In general you will see better throughput by using memory channel as opposed to using
file channel at the loss of durability.
> The channel capacity is going to need to be sized such that it is large enough to hold
as many events as will be added to it by upstream agents. Ideal flow would see the sink draining
events from the channel faster than it is having events added by its source.
> The channel transaction capacity will need to be smaller than the channel capacity.
> e.g. If your Channel capacity is set to 10000 than Channel Transaction Capacity should
be set to something like 100.
> Specifically if we have clients with varying frequency of event generation, i.e. some
clients generating thousands of events/sec, while
> others at a much slower rate, what effect will different values of these params have
on these clients ?
> Transaction Capacity is going to be what throttles or limits how many events the source
can put into the channel. This going to vary depending on how many tiers of agents/collectors
you have setup.
> In general though this should probably be equal to whatever you have the batch size set
to in your client.
> With regards to the hdfs batch size, the larger your batch size the better performance
will be. However, keep in mind that if a transaction fails the entire transaction will be
replayed which could have the implication of duplicate events downstream.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message