Return-Path: X-Original-To: apmail-flume-user-archive@www.apache.org Delivered-To: apmail-flume-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1F542E74C for ; Wed, 9 Jan 2013 02:41:00 +0000 (UTC) Received: (qmail 76774 invoked by uid 500); 9 Jan 2013 02:40:59 -0000 Delivered-To: apmail-flume-user-archive@flume.apache.org Received: (qmail 76723 invoked by uid 500); 9 Jan 2013 02:40:59 -0000 Mailing-List: contact user-help@flume.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flume.apache.org Delivered-To: mailing list user@flume.apache.org Received: (qmail 76714 invoked by uid 99); 9 Jan 2013 02:40:59 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Jan 2013 02:40:59 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jlord@cloudera.com designates 209.85.215.45 as permitted sender) Received: from [209.85.215.45] (HELO mail-la0-f45.google.com) (209.85.215.45) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Jan 2013 02:40:53 +0000 Received: by mail-la0-f45.google.com with SMTP id ep20so1256293lab.32 for ; Tue, 08 Jan 2013 18:40:31 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:x-gm-message-state; bh=7e4FHU2AMPgmixepIzHCDgcPj2vDgoaLI18th1hs4UQ=; b=Chqwa0Pdu9xpjGMMyC0BWhCPbgcwSSwJe3MWNtBRuEC0Qq2ZQ9ouiq2xvkzg1u5tOC 4U904fk3+y0vDpVMiyHpPwF+/ehgfY4sZ9RVafOrNbIbFojiKTYEgRHSLUgznlLBryvT etR/EkPL8vZ5lCzx97dlubTY6eRSk/hZ9CfbyU+y9fRRGgS4wGlz15R9gJHi9419bQU+ yRklnol3ByDkgMXtRl/sIWOCz26tTiZ+Z3s1i8TGdwsDC/ZnI2g5cOVjqlfPrkWPb2Le PqLb8waVqCgzObfzK4cmyL25LizODT8KwwyjRw7icgYYdssl31tG1W7h5wGRICI1cZnV n/QQ== MIME-Version: 1.0 Received: by 10.152.124.226 with SMTP id ml2mr62836194lab.46.1357699231593; Tue, 08 Jan 2013 18:40:31 -0800 (PST) Received: by 10.112.99.193 with HTTP; Tue, 8 Jan 2013 18:40:31 -0800 (PST) In-Reply-To: References: Date: Tue, 8 Jan 2013 18:40:31 -0800 Message-ID: Subject: Re: Of BatchSize / Channel Capacity / Transaction Capacity From: Jeff Lord To: user@flume.apache.org Content-Type: multipart/alternative; boundary=f46d042f96fc6933e904d2d1fc02 X-Gm-Message-State: ALoCoQn/ekukkYvjpYlPHUb2u6rDabqyIIaQRwGPjgQ/vUuhFMDgXVfOztWFmOwG2BO0Q5k0i6RT X-Virus-Checked: Checked by ClamAV on apache.org --f46d042f96fc6933e904d2d1fc02 Content-Type: text/plain; charset=ISO-8859-1 Hi Bashkar, 1) Batch Size 1.a) When configured by client code using the flume-core-sdk , to send events to flume avro source. The flume client sdk has an appendBatch method. This will take a list of events and send them to the source as a batch. This is the size of the number of events to be passed to the source at one time. 1.b) When set as a parameter on HDFS sink (or other sinks which support BatchSize parameter) This is the number of events written to file before it is flushed to HDFS 2) 2.a) Channel Capacity This is the maximum capacity number of events of the channel. 2.b) Channel Transaction Capacity. This is the max number of events stored in the channel per transaction. How will setting these parameters to different values, affect throughput, latency in event flow? In general you will see better throughput by using memory channel as opposed to using file channel at the loss of durability. The channel capacity is going to need to be sized such that it is large enough to hold as many events as will be added to it by upstream agents. Ideal flow would see the sink draining events from the channel faster than it is having events added by its source. The channel transaction capacity will need to be smaller than the channel capacity. e.g. If your Channel capacity is set to 10000 than Channel Transaction Capacity should be set to something like 100. Specifically if we have clients with varying frequency of event generation, i.e. some clients generating thousands of events/sec, while others at a much slower rate, what effect will different values of these params have on these clients ? Transaction Capacity is going to be what throttles or limits how many events the source can put into the channel. This going to vary depending on how many tiers of agents/collectors you have setup. In general though this should probably be equal to whatever you have the batch size set to in your client. With regards to the hdfs batch size, the larger your batch size the better performance will be. However, keep in mind that if a transaction fails the entire transaction will be replayed which could have the implication of duplicate events downstream. -Jeff On Tue, Jan 8, 2013 at 10:46 AM, Bhaskar V. Karambelkar wrote: > Can some one explain the importance of the following > 1) Batch Size > 1.a) When configured by client code using the flume-core-sdk , to send > events to flume avro source. > 1.b) When set as a parameter on HDFS sink (or other sinks which support > BatchSize parameter) > 2) > 2.a) Channel Capacity > 2.b) Channel Transaction Capacity. > > > Under which conditions should these params be set to high values, and > under which conditions should they be set to low values. > > > How will setting these parameters to different values, affect throughput, > latency in event flow. > Specifically if we have clients with varying frequency of event > generation, i.e. some clients generating thousands of events/sec, while > others at a much slower rate, what effect will different values of these > params have on these clients ? > > thanks > Bhaskar > --f46d042f96fc6933e904d2d1fc02 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hi Bashkar,

1) Batch Size=A0

=A0 1.a) When configured by client code= using the flume-core-sdk , to send events to flume avro source. The flume client sdk has an appendBatch method. This will take a list of ev= ents and send them to the source as a batch.=A0This is the size of the numb= er of events to be passed to the source at one time.

=A0 1.b= ) When set as a parameter on HDFS sink (or other sinks which support BatchS= ize parameter)
This is the number of events written to file before it is flushed to H= DFS

2)
=A0 2.a) Channel Capacity
This is the maximum capacity=A0number=A0of events of the channel.=

=A0 2.b) Channel Transaction Capacity.
This is th= e max number of events stored in the channel per transaction.
<= div>
How will setting these parameters to different valu= es, affect throughput, latency in event flow?

In general you will see better throughput by us= ing memory channel as opposed to using file channel at the loss of durabili= ty.

The channel capacity is going to need to be sized such = that it is large enough to hold as many events as will be added to it by up= stream agents. Ideal flow would see the sink draining events from the chann= el faster than it is having events added by its source.

The channel transaction capacity will need to be smalle= r than the channel capacity.
e.g. If your Channel capacity is set= to 10000 than Channel Transaction Capacity should be set to something like= 100.

Specifically if we have clients with varying frequency= of event generation, i.e. some clients generating thousands of events/sec,= while
others at a much slower rate, what effect will different values of these p= arams have on these clients ?

Transaction Capacity is going to = be what throttles or limits how many events the source can put into the cha= nnel. This going to vary depending on how many tiers of agents/collectors y= ou have setup.
In general though this should probably be equal to whatever you have = the batch size set to in your client.

With= regards to the hdfs batch size, the larger your batch size the better perf= ormance will be. However, keep in mind that if a transaction fails the enti= re transaction will be replayed which could have the implication of duplica= te events downstream.

-Jeff




On Tue, Jan 8, 2013 at 10:46 AM, Bhaskar V. Karambel= kar <bhaskarvk@gmail.com> wrote:
Can some one explain the importance of the f= ollowing
1) Batch Size
=A0 1.a) When configured by client code usin= g the flume-core-sdk , to send events to flume avro source.
=A0 1.b) When set as a parameter on HDFS sink (or other sinks which support= BatchSize parameter)
2)
=A0 2.a) Channel Capacity
=A0 2.b) Channel Transaction Capacity.

Under which conditions should these params be set to high values,= and under which conditions should they be set to low values.


Ho= w will setting these parameters to different values, affect throughput, lat= ency in event flow.
Specifically if we have clients with varying frequency of event generation,= i.e. some clients generating thousands of events/sec, while
others at a= much slower rate, what effect will different values of these params have o= n these clients ?

thanks
Bhaskar

--f46d042f96fc6933e904d2d1fc02--