incubator-flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Percy <mpe...@cloudera.com>
Subject Re: Flume NG performance metrics
Date Tue, 08 May 2012 17:36:22 GMT
The "flow" terminology is something I could have defined better. I've updated the config file
fragment a bit in the example on the wiki to hopefully clarify what was done: https://cwiki.apache.org/confluence/display/FLUME/Flume+NG+Performance+Measurements#FlumeNGPerformanceMeasurements-SyslogPerformanceTest20120430

Generally when folks use the term "flow" as it relates to Flume it means an independent Source,
Channel, Sink combination, which do not interact with any other Sources, Channels, or Sinks
on the same agent. So each flow in this case had its own SyslogTcpSource port, its own MemoryChannel,
and its own HDFSEventSink output path.

Mike

On May 8, 2012, at 9:38 AM, Arvind Prabhakar wrote:

> Hi Ahmed,
> 
> >3.  6 flume agent servers that collected data in memory and flushed to the 9 hadoop
servers
> 
> All tests were carried out with a single Flume Agent (which corresponds to one JVM process)
on a single host. However, each measurement was made with a different number of flows passing
through this agent. 
> 
> Thanks,
> Arvind
> 
> 
> On Tue, May 8, 2012 at 9:08 AM, S Ahmed <sahmed1020@gmail.com> wrote:
> Just want to make sure I understand the setup:
> 
> 1. 9 hadoop servers that were fed the data
> 2. 1 server was used to generate the syslog data that was spread accross the 6 flume
agent servers
> 3.  6 flume agent servers that collected data in memory and flushed to the 9 hadoop servers
> 
> Is that right?
> 
> 
> On Tue, May 8, 2012 at 1:49 AM, Jarek Jarcec Cecho <jarcec@apache.org> wrote:
> Thanks Mike,
> this is in deed very helpful!
> 
> Jarcec
> 
> On Mon, May 07, 2012 at 06:55:49PM -0700, Mike Percy wrote:
> > Hi folks,
> > Will McQueen and I have been doing some Flume NG stress and performance testing,
and we wanted to share some of our recent findings. The focus of the most recent tests has
been on the syslog TCP source, memory channel, and HDFS sink.
> >
> > I wrote some software to generate load in syslog format over TCP and to automate
some of the analysis. The first thing we wanted to verify is that no data was lost during
these tests (a.k.a. correctness), with a close second priority being of course throughput
(performance). I used Pig and AvroStorage from piggybank in the data integrity analysis, and
committed the compiled (0.11 trunk) piggybank jar so the load analysis scripts would be relatively
easy to use. It seems to be compatible with Pig 0.8.1. I am a little wary of having to maintain
that type of thing at the Apache org level so for now I have checked all the code in on Github
under an ASL 2.0 license:
> >
> > https://github.com/mpercy/flume-load-gen
> >
> > I have created a Wiki page with the performance metrics we have come up with so
far. The executive summary is that at the time of this writing, we have observed Flume NG
on a single machine processing events at a throughput rate of 70,000+ events/sec with no data
loss.
> >
> > https://cwiki.apache.org/confluence/display/FLUME/Flume+NG+Performance+Measurements
> >
> > I have put more details on the wiki page itself. Please let me know if you want
me to add more detail. I'll be looking into improving the performance of these components
going forward, however we wanted to post these results to set a public performance baseline
of Flume NG.
> >
> > If others have done performance testing, we would love to see your results if you
can post the details.
> >
> > Regards,
> > Mike
> >
> 
> 


Mime
View raw message