flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hari Shreedharan <hshreedha...@cloudera.com>
Subject Re: Configuring flume for better throughput
Date Thu, 01 Aug 2013 03:27:54 GMT
A lot of it depends on the disks you are using and how many disks you have given the file channel.
In general, performance improves if you give it more disks, as it round-robins between disks,
so multiple writes and reads can happen without waiting for a full seek. 

Also, the file channel does write every event to disk when they are written to the channel
- and when they are read, they are read back from disk (See the Log Structured File System
paper for details on the basic design).  This allows the channel to hold more events than
can fit in memory and also allows full recovery from failure. I'd recommend using a Null sink
or a custom sink that updates some metrics (and does nothing else) to see if the File Channel
is really your bottle neck. 


Thanks,
Hari


On Wednesday, July 31, 2013 at 7:24 PM, Pankaj Gupta wrote:

> Also, agent1.sinks.hdfs-sink1-1.hdfs.threadsPoolSize = 1, might seem odd but we only
write to one file on HDFS per sink, so 1 seems to be the right value. In any case, I've tried
increasing this value to 10 to no effect.
> 
> 
> On Wed, Jul 31, 2013 at 7:22 PM, Pankaj Gupta <pankaj@brightroll.com (mailto:pankaj@brightroll.com)>
wrote:
> > I'm continuing to debug the performance issues, added more sinks but it all seems
to be boiling down to the performance of the FileChannel. Right now I'm focusing on the performance
of the HDFS Writer machine. On that machine I have 4 disks(apart from a separate disk just
for the OS), so I'm using 4 file channels with checkpoint + data directories on their own
dedicated disk. As mentioned earlier, Avro Sinks write to these FileChannels and HDFS Sinks
drain the channel. I'm getting very poor performance draining the channels, ~2.5MB/s for all
4 channels combined. I replaced the file channel with memory channel just to test and saw
that I could drain the channels at more than 15 MB/s. So HDFS sinks aren't the issue. 
> > 
> > I haven't seen any issue with writing to the FileChannel so far, I'm surprised that
reading is turning out to be slower. Here are the FileChannel stats:
> > "CHANNEL.ch1": {
> >         "ChannelCapacity": "75000000",
> >         "ChannelFillPercentage": "7.5033080000000005",
> >         "ChannelSize": "5627481",
> >         "EventPutAttemptCount": "11465743",
> >         "EventPutSuccessCount": "11465481",
> >         "EventTakeAttemptCount": "5841907",
> >         "EventTakeSuccessCount": "5838000",
> >         "StartTime": "1375320933471",
> >         "StopTime": "0",
> >         "Type": "CHANNEL"
> >     },
> >     
> > 
> > EventTakeAttemptCount is much less than EventPutAttemptCount and the sinks are lagging.
I'm surprised how even the attempts to drain the channel are lesser. That would seem to point
to the HDFS sinks but they do just fine with the Memory Channel, so they are clearly not bound
on either writing to HDFS or on network I/O. I've checked the network capacity separately
as well and we are using less than 10% of the network capacity, thus definitely not bound
there.
> > 
> > In my workflow reliability of FileChannel is essential thus can't switch to Memory
channel. I would really appreciate any suggestions on how to tune the performance of FileChannel.
Here are the settings of one of the FileChannels: 
> > 
> > agent1.channels.ch1.type = FILE
> > agent1.channels.ch1.checkpointDir = /flume1/checkpoint
> > agent1.channels.ch1.dataDirs = /flume1/data
> > agent1.channels.ch1.maxFileSize = 375809638400
> > agent1.channels.ch1.capacity = 75000000
> > 
> > agent1.channels.ch1.transactionCapacity = 24000
> > agent1.channels.ch1. checkpointInterval = 300000
> > 
> > 
> > As can be seen I increased the checkpointInterval but that didn't help either. 
> > 
> > Here are the settings for one of the HDFS Sinks. I have tried varying the number
of these sinks from 8 to 32 to no effect:
> > agent1.sinks.hdfs-sink1-1.channel = ch1
> > agent1.sinks.hdfs-sink1-1.type = hdfs
> > #Use DNS of the HDFS namenode
> > agent1.sinks.hdfs-sink1-1.hdfs.path = hdfs://nameservice1/store/f-1-1/
> > agent1.sinks.hdfs-sink1-1.hdfs.filePrefix = event
> > agent1.sinks.hdfs-sink1-1.hdfs.writeFormat = Text
> > agent1.sinks.hdfs-sink1-1.hdfs.rollInterval = 120
> > agent1.sinks.hdfs-sink1-1.hdfs.idleTimeout= 180
> > agent1.sinks.hdfs-sink1-1.hdfs.rollCount = 0
> > agent1.sinks.hdfs-sink1-1.hdfs.rollSize = 0
> > agent1.sinks.hdfs-sink1-1.hdfs.fileType = DataStream
> > agent1.sinks.hdfs-sink1-1.hdfs.batchSize = 1000
> > agent1.sinks.hdfs-sink1-1.hdfs.txnEventSize = 1000
> > agent1.sinks.hdfs-sink1-1.hdfs.callTimeout = 20000
> > agent1.sinks.hdfs-sink1-1.hdfs.threadsPoolSize = 1
> > 
> > 
> > I've tried increasing the batchSize(along with txnEventSize) of HDFS Sink from 1000
to 240000 without effect.
> > 
> > I've also verified that there is enough RAM on the box for enough page cache and
iostat shows almost no reads going to disk. I really can't figure out why FileChannel would
be so much slower than memory channel if reads are being served from Memory. 
> > 
> > FileChannel is so fundamental to our workflow, I would expect it would be for others
too. What has been the experience of others with FileChannel? I will really appreciate any
suggestions.
> > 
> > Thanks in Advance,
> > Pankaj
> > 
> > 
> > 
> > On Fri, Jul 26, 2013 at 2:12 PM, Pankaj Gupta <pankaj@brightroll.com (mailto:pankaj@brightroll.com)>
wrote:
> > > Here is the flume config of the collector machine. The File channel is drained
by 4 flume sinks that send messages to a separate hdfs-writer machine.
> > > 
> > > 
> > > agent1.channels.ch1.type = FILE 
> > > agent1.channels.ch1.checkpointDir = /flume1/checkpoint
> > > agent1.channels.ch1.dataDirs = /flume1/data
> > > agent1.channels.ch1.maxFileSize = 375809638400
> > > agent1.channels.ch1.capacity = 75000000
> > > agent1.channels.ch1.transactionCapacity = 4000
> > > 
> > > agent1.sources.avroSource1.channels = ch1
> > > agent1.sources.avroSource1.type = avro
> > > agent1.sources.avroSource1.bind = 0.0.0.0
> > > agent1.sources.avroSource1.port = 4545
> > > agent1.sources.avroSource1.threads = 16 
> > > 
> > > agent1.sinks.avroSink1-1.type = avro
> > > agent1.sinks.avroSink1-1.channel = ch1
> > > agent1.sinks.avroSink1-1.hostname = hdfs-writer-machine-a.mydomain.com (http://hdfs-writer-machine-a.mydomain.com)
> > > agent1.sinks.avroSink1-1.port = 4545
> > > agent1.sinks.avroSink1-1.connect-timeout = 300000
> > > agent1.sinks.avroSink1-1.batch-size = 4000
> > > 
> > > agent1.sinks.avroSink1-2.type = avro 
> > > agent1.sinks.avroSink1-2.channel = ch1
> > > agent1.sinks.avroSink1-2.hostname = hdfs-writer-machine-b.mydomain.com (http://hdfs-writer-machine-b.mydomain.com)
> > > agent1.sinks.avroSink1-2.port = 4545
> > > agent1.sinks.avroSink1-2.connect-timeout = 300000
> > > agent1.sinks.avroSink1-2.batch-size = 4000
> > > 
> > > agent1.sinks.avroSink1-3.type = avro
> > > agent1.sinks.avroSink1-3.channel = ch1
> > > agent1.sinks.avroSink1-3.hostname = hdfs-writer-machine-c.mydomain.com (http://hdfs-writer-machine-c.mydomain.com)
> > > agent1.sinks.avroSink1-3.port = 4545
> > > agent1.sinks.avroSink1-3.connect-timeout = 300000
> > > agent1.sinks.avroSink1-3.batch-size = 4000
> > > 
> > > agent1.sinks.avroSink1-4.type = avro
> > > agent1.sinks.avroSink1-4.channel = ch1
> > > agent1.sinks.avroSink1-4.hostname = hdfs-writer-machine-d.mydomain.com (http://hdfs-writer-machine-d.mydomain.com)
> > > agent1.sinks.avroSink1-4.port = 4545
> > > agent1.sinks.avroSink1-4.connect-timeout = 300000
> > > agent1.sinks.avroSink1-4.batch-size = 4000
> > > 
> > > 
> > > #Add the sink groups; load-balance between each group of sinks which round
robin between different hops 
> > > agent1.sinkgroups = group1
> > > agent1.sinkgroups.group1.sinks = avroSink1-1 avroSink1-2 avroSink1-3 avroSink1-4

> > > agent1.sinkgroups.group1.processor.type = load_balance
> > > agent1.sinkgroups.group1.processor.selector = ROUND_ROBIN
> > > agent1.sinkgroups.group1.processor.backoff = true
> > > 
> > > 
> > > 
> > > On Fri, Jul 26, 2013 at 1:38 PM, Pankaj Gupta <pankaj@brightroll.com (mailto:pankaj@brightroll.com)>
wrote:
> > > > Hi Roshan,
> > > > 
> > > > Thanks for the reply. Sorry I worded the first question wrong and confused
sources with sinks. What I meant to ask was: 
> > > > 1. Are the batches from flume Avro Sink sent to the Avro Source on the
next machine in a pipelined fasion or is the next batch only sent once an ack for previous
batch is received?
> > > > 
> > > > Overall it sounds like adding more sinks would provide more concurrency.
I'm going to try that. 
> > > > 
> > > > About the large batch size, in our use case it won't be a big issue as
long as we can set a timeout after which whatever events are accumulated are sent without
requiring the batch to be full. Does such a setting exist? 
> > > > 
> > > > Thanks, 
> > > > Pankaj
> > > > 
> > > > 
> > > > 
> > > > 
> > > > On Fri, Jul 26, 2013 at 10:59 AM, Roshan Naik <roshan@hortonworks.com
(mailto:roshan@hortonworks.com)> wrote:
> > > > > could you provide a sample of the config you are using ? 
> > > > > 
> > > > > Are the batches from flume source sent to the sink in a pipelined
fasion or is the next batch only sent once an ack for previous batch is received? 
> > > > > Source does not send to sink directly. Source dumps a batch of events
into the channel... and the sink picks it form the channel in batches and writes them to destination.
Sink fetches a batch from channel and writes to destination and then fetches the next batch
from channel.. and the cycle continues.
> > > > > 
> > > > > If the batch send is not pipelined then would increasing the number
of sinks draining from the channel help.
> > > > > The idea behind this is to basically achieve pipelining by having
multiple outstanding requests and thus use network better.
> > > > > Increasing the number of sinks will increase concurrency. 
> > > > > 
> > > > > If batch size is very large, e.g. 1 million, would the batch only
be sent once that many events have accumulated or is there a time limit after which whatever
events are accumulated are sent? Is this timelimit configurable? (I looked in the Avro Sink
documentation for such a setting: http://flume.apache.org/FlumeUserGuide.html, but couldn't
find anything, hence asking the question)
> > > > > IMO...Not a good idea to have such a large batch.. esp if you like
to have concurrent sinks. each sink will need to wait for 1mill events to close the transactions
on the channel. 
> > > > > 
> > > > > Does enabling ssl have any significant impact on throughput?
> > > > > Increase in latency is expected but does this also affect throughput.

> > > > > perhaps somebody can comment on this.
> > > > > 
> > > > > 
> > > > > -roshan 
> > > > > 
> > > > > 
> > > > > 
> > > > > On Fri, Jul 26, 2013 at 12:34 AM, Derek Chan <derekcsy@gmail.com
(mailto:derekcsy@gmail.com)> wrote:
> > > > > > We have a similar setup (Flume 1.3) and same problems here.
Increasing the batch size did not help much but setting up multiple AvroSinks did.  
> > > > > > 
> > > > > > 
> > > > > > On 26/7/2013 9:31, Pankaj Gupta wrote:
> > > > > > > Hi, 
> > > > > > > 
> > > > > > > We are trying to figure out how to get better throughput
in our flume pipeline. We have flume instances on a lot of machines writing to a few collector
machines running with a File Channel which in turn write to still fewer hdfs writer machines
running with a File Channel and HDFS Sinks. 
> > > > > > > 
> > > > > > > The problem that we're facing is that we are not getting
good network usage between our flume collector machines and hdfs writer machines. The way
these machines are connected is that the filechannel on collector drains to an Avro Sink which
sends to Avro Source on the writer machine, which in turn writes to a filechannel draining
into an HDFS Sink. So: 
> > > > > > > 
> > > > > > > [FileChannel -> Avro Sink] -> [Avro Source ->
FileChannel -> HDFS Sink] 
> > > > > > > 
> > > > > > > I did a raw network throughput test(using netcat on the
command line) between the collector and the writer and saw a throughput of ~200Megabits/sec.
Whereas the network throughput  (which I observed using iftop) between collector avro sink
and writer avro source never went over 25Megabits/sec, even when the filechannel on the collector
was quite full with millions of events queued up. We obviously want to use the network better
and I am exploring ways of achieving that. The batch size we are using on avro sink on the
collector is 4000. 
> > > > > > > 
> > > > > > > I have a few questions regarding how AvroSource and Sink
work together to help me improve the throughput and will really appreciate a response: 
> > > > > > > Are the batches from flume source sent to the sink in a
pipelined fasion or is the next batch only sent once an ack for previous batch is received?
> > > > > > > If the batch send is not pipelined then would increasing
the number of sinks draining from the channel help.
> > > > > > > The idea behind this is to basically achieve pipelining
by having multiple outstanding requests and thus use network better. 
> > > > > > > If batch size is very large, e.g. 1 million, would the
batch only be sent once that many events have accumulated or is there a time limit after which
whatever events are accumulated are sent? Is this timelimit configurable? (I looked in the
Avro Sink documentation for such a setting: http://flume.apache.org/FlumeUserGuide.html, but
couldn't find anything, hence asking the question)
> > > > > > > Does enabling ssl have any significant impact on throughput?
> > > > > > > Increase in latency is expected but does this also affect
throughput. 
> > > > > > > We are using flume 1.4.0.
> > > > > > > 
> > > > > > > Thanks in Advance, 
> > > > > > > Pankaj
> > > > > > > 
> > > > > > > -- 
> > > > > > > 
> > > > > > > P | (415) 677-9222 ext. 205 (tel:%28415%29%20677-9222%20ext.%20205)
F | (415) 677-0895 | pankaj@brightroll.com (mailto:pankaj@brightroll.com) 
> > > > > > > Pankaj Gupta | Software Engineer
> > > > > > > BrightRoll, Inc. | Smart Video Advertising | www.brightroll.com
(http://www.brightroll.com/)
> > > > > > > 
> > > > > > > United States | Canada | United Kingdom | Germany 
> > > > > > > 
> > > > > > > We're hiring (http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7)!

> > > > > 
> > > > 
> > > > 
> > > > 
> > > > -- 
> > > > 
> > > > P | (415) 677-9222 ext. 205 (tel:%28415%29%20677-9222%20ext.%20205) F
| (415) 677-0895 (tel:%28415%29%20677-0895) | pankaj@brightroll.com (mailto:pankaj@brightroll.com)

> > > > Pankaj Gupta | Software Engineer
> > > > BrightRoll, Inc. | Smart Video Advertising | www.brightroll.com (http://www.brightroll.com/)
> > > > 
> > > > United States | Canada | United Kingdom | Germany 
> > > > 
> > > > We're hiring (http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7)!

> > > 
> > > 
> > > -- 
> > > 
> > > P | (415) 677-9222 ext. 205 (tel:%28415%29%20677-9222%20ext.%20205) F | (415)
677-0895 (tel:%28415%29%20677-0895) | pankaj@brightroll.com (mailto:pankaj@brightroll.com)

> > > Pankaj Gupta | Software Engineer
> > > BrightRoll, Inc. | Smart Video Advertising | www.brightroll.com (http://www.brightroll.com/)
> > > 
> > > United States | Canada | United Kingdom | Germany 
> > > 
> > > We're hiring (http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7)!

> > 
> > 
> > -- 
> > 
> > P | (415) 677-9222 ext. 205 (tel:%28415%29%20677-9222%20ext.%20205) F | (415) 677-0895
(tel:%28415%29%20677-0895) | pankaj@brightroll.com (mailto:pankaj@brightroll.com) 
> > Pankaj Gupta | Software Engineer
> > BrightRoll, Inc. | Smart Video Advertising | www.brightroll.com (http://www.brightroll.com/)
> > 
> > United States | Canada | United Kingdom | Germany 
> > 
> > We're hiring (http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7)!

> 
> 
> -- 
> 
> P | (415) 677-9222 ext. 205 F | (415) 677-0895 | pankaj@brightroll.com (mailto:pankaj@brightroll.com)

> Pankaj Gupta | Software Engineer
> BrightRoll, Inc. | Smart Video Advertising | www.brightroll.com (http://www.brightroll.com/)
> 
> United States | Canada | United Kingdom | Germany 
> 
> We're hiring (http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7)!



Mime
View raw message