flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 김동경 <style9...@gmail.com>
Subject Re: Flume perf measurements
Date Sat, 11 Apr 2015 06:18:18 GMT
Thank you for sharing Roshan.
I have a few questions.

1. What kind of, and how many hardware(HDD) did you use when you do file
channel benchmark?
 Actually, I also did benchmark with file channel, I only get 2K~3K TPS.
Did you use HDDs for each data dirs?

Could you share the most influential parts in configuration for high
performance?


2. Regarding 100K exec source batch size, if agent falls down before all
the events are committed to channel, aren`t those messages lost?
Do you have any measures to handle that message loss?


Thanks in advance
Regards
Dongkyoung.



2015-04-11 3:44 GMT+09:00 Roshan Naik <roshan@hortonworks.com>:

>  Will have this info on the wiki soon, but thought of sending it out
> right away to users list also since there seem to be some threads on
> performance in the users list.
>
>
>
>  Sample Flume v*1.4* Measurements for reference:
>
> Here are some sample measurements taken with a single agent and 500 byte
> events.
>
> *Cluster Config:* 20-node Hadoop cluster (1 name node and 19 data nodes).
>
> *Machine Config:* 24 cores – Xeon E5-2640 v2 @ 2.00GHz, 164 GB RAM.
>
>
>  1.     File channel with HDFS Sink (Sequence File):
>
> *Source:* 4 x Exec Source, 100k batchSize
>
> HDFS Sink Batch size: 500,000
>
> Channel: File
>
> Number of data dirs: 8
>
>
>
>
>
> *Events/Sec*
>
> *Sink Count*
>
> *1 data dirs*
>
> *2 data dirs*
>
> *4 data dirs*
>
> *6 data dirs*
>
> *8 data dirs*
>
> *10 data dirs*
>
> 1
>
> 14.3 k
>
>
>
>
>
>
>
>
>
>
>
> 2
>
> 21.9 k
>
>
>
>
>
>
>
>
>
>
>
> 4
>
>
>
> 35.8 k
>
>
>
>
>
>
>
>
>
> 8
>
> 24.8 k
>
> 43.8 k
>
> 72.5 k
>
> 77 k
>
> *78.6 k*
>
> 76.6 k
>
> 10
>
>
>
>
>
> 58 k
>
>
>
>
>
>
> 12
>
>
>
>
>
> 49.3 k
>
> 49 k
>
>
>
>
> Was looking for sweet spot in perf. So did not take measurements for all
> data  points on grid. Only too for the ones that made sense. For example:
> when perf dropped by adding more sinks, did not take more measurements for
> those rows.
>
>
>  2.     HDFS Sink:
>
> Channel: Memory
>
>
>
> *# of  HDFS*
>
> *Sinks*
>
> *Snappy*
>
> *BatchSz:1.2mill*
>
> *Snappy*
>
> *BatchSz:1.4mill*
>
> *Sequence File*
>
> *BatchSz:1.2mill*
>
> 1
>
> 34.3 k
>
> 33 k
>
> 33 k
>
> 2
>
> 71 k
>
> 75 k
>
> 69 k
>
> 4
>
> 141 k
>
> 145 k
>
> 141 k
>
> 8
>
> 271 k
>
> 273 k
>
> 251 k
>
> 12
>
> 382 k
>
> 380 k
>
> 370 k
>
> 16
>
> 478 k
>
> *538 k*
>
> *486 k*
>
>
>  *Some simple **observations :*
>
>    - *increasing number of dataDirs helps FC perf even on single disk
>    systems  *
>    - *Increasing  number of sinks helps*
>    - *Max throughput observed was about 538k events/sec for HDFS sink
>    which is approx 240MB/s *
>
>

Mime
View raw message