Mailing-List: contact user-help@flume.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flume.apache.org
Received-SPF: pass (nike.apache.org: domain of roshan@hortonworks.com
 designates 209.85.214.169 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAC6YYrjk7=k-7YrU5MLeiFHFN7N_iGheeJ2oz_42GPVgDGkGkw@mail.gmail.com>
References: 
 <CAM1Zi5Es5kTw+-NOL4qUz0JSq+0Y5ZxoHMML4VdQoKefwgmaZA@mail.gmail.com>
	<CAC6YYrjk7=k-7YrU5MLeiFHFN7N_iGheeJ2oz_42GPVgDGkGkw@mail.gmail.com>
Date: Tue, 12 Mar 2013 14:12:26 -0700
Message-ID: 
 <CAC6YYrh2bxdO+Dw=vncFkfa7QaRvp7+OcmoZOVUZEgRbKsaRMA@mail.gmail.com>
Subject: Re: Best way to increase throughput of Exec->Memory->Avro agent.
From: Roshan Naik <roshan@hortonworks.com>
To: user@flume.apache.org
Content-Type: text/plain; charset=ISO-8859-1

i meant 640,000 not 64,000

On Tue, Mar 12, 2013 at 2:10 PM, Roshan Naik <roshan@hortonworks.com> wrote:
> beyond a certain # of sinks it wont help adding more. my suspicion is
> you may have gone way overboard.
>
>  if your sink-side batch size is that large and you have 64 sinks in
> the round-robin.. it will take a lot of events (64,000) to be pumped
> in by the source order before the first event can start trickling out
> of any sink.  Also memory consumption will be quite high.. each sink
> will open a transaction and hold on to 10000 events. This the cause
> for the Memory channel filling up. Until the sink side transaction is
> committed (i.e 10k events are pulled), the memory reservation on the
> channel is not relinquished. So your memory channel size will have to
> really high to support so manch sinks each with such a big batch size.
>
> My gut feel is that your source-side batch size is not much of an
> issue and can be smaller. Increasing the number of sinks will only
> help if the sink is indeed the bott
>
> On Tue, Mar 12, 2013 at 1:43 PM, Chris Neal <cwneal@gmail.com> wrote:
>> Hi all.
>>
>> I've been working on this for quite some time, and need some advice from the
>> experts.  I have a two tiered Flume architecture:
>>
>> App Tier (all on one server):
>>  124 ExecSources -> MemoryChannel -> AvroSinks
>>
>> HDFS Tier (on two servers):
>>   AvroSource -> FileChannel -> HDFSSinks
>>
>> When I run the agents, the HDFS tier is keeping up fine with the App Tier.
>> queue sizes stay between 0-10000 (I have a batch size of 10000).  All is
>> good.
>>
>> On the App Tier, when I view the JMX data through jconsole, I watch the size
>> of the MemoryChannel grow steadily until it reaches the max, then it starts
>> throwing exceptions about not being able to put the batch on the channel as
>> expected.
>>
>> There seems to be two basic ways to increase the throughput of the App Tier:
>> 1.  Increase the MemoryChannel's transactionCapacity and the corresponding
>> AvroSink's batch-size.  Both are set to 10000 for me.
>> 2.  Increase the number of AvroSinks to drain the MemoryChannel.  I'm up to
>> 64 Sinks now which round-robin between the two Flume Agents on the HDFS
>> tier.
>>
>> Both of those values seem quite high to me (batch size and number of sinks).
>>
>> Am I missing something as far as tuning?
>> Which would allow for greater increase to throughput, more Sinks or larger
>> batch size?
>>
>> I'm stumped here.  I still think I can get this to work. :)
>>
>> Any suggestions are most welcome.
>> Thanks for your time.
>> Chris
>>