flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From S Ahmed <sahmed1...@gmail.com>
Subject Re: Flume NG performance metrics
Date Thu, 10 May 2012 20:25:33 GMT
Great thanks.

Kafka takes advantage of nio and zero copy, it also seems to block during
the flush to disk (while Flume, correct me if I am wrong, is poping one
item at a time off the log collection).

I wonder if that is a big part of the performance differences (assuming
their are performance differences).

On Tue, May 8, 2012 at 7:16 PM, Mike Percy <mpercy@cloudera.com> wrote:

> Flume would likely do pretty well in that kind of scenario as well -- if
> you wanted to have a huge transaction size (2,000,000 events, I guess) as
> well as a very large capacity in the memory channel. You will also need to
> increase the timeout of the HDFS sink and disable serialization. At that
> point, you will likely be limited more by the speed of your Source than
> anything else.
> However, depending on your use case this kind of benchmark is somewhat
> academic, right? I guess it depends on what kind of reliability semantics
> you are looking for and what serialization and other support you want. If
> your OS crashes or reboots, then you've potentially lost millions of events
> in your memory channel if you are storing tons of events there. So
> performance and reliability will always be a tradeoff.
> Even so, I will be the first to point out that we could do much better.
> And we are working on improving the performance of the various components
> in the system.
> Mike
> On May 8, 2012, at 11:35 AM, S Ahmed wrote:
> I have played around with Kafka, and with a single producer (which I think
> in flume terms is a flow) I can write about 200 000 messages a second that
> are 200 bytes each.  What it does it it writes in memory for 10 seconds and
> flushes to disk. (this was on a dedicated box, with 4 gb of ram)
> This is just writing and not consuming (or sinking in flumes terms).
> Curious if there was a reason why flume does 70K/second.
> On Tue, May 8, 2012 at 1:36 PM, Mike Percy <mpercy@cloudera.com> wrote:
>> The "flow" terminology is something I could have defined better. I've
>> updated the config file fragment a bit in the example on the wiki to
>> hopefully clarify what was done:
>> https://cwiki.apache.org/confluence/display/FLUME/Flume+NG+Performance+Measurements#FlumeNGPerformanceMeasurements-SyslogPerformanceTest20120430
>> Generally when folks use the term "flow" as it relates to Flume it means
>> an independent Source, Channel, Sink combination, which do not interact
>> with any other Sources, Channels, or Sinks on the same agent. So each flow
>> in this case had its own SyslogTcpSource port, its own MemoryChannel, and
>> its own HDFSEventSink output path.
>> Mike
>> On May 8, 2012, at 9:38 AM, Arvind Prabhakar wrote:
>> Hi Ahmed,
>> >3.  6 flume agent servers that collected data in memory and flushed to
>> the 9 hadoop servers
>> All tests were carried out with a single Flume Agent (which corresponds
>> to one JVM process) on a single host. However, each measurement was made
>> with a different number of flows passing through this agent.
>> Thanks,
>> Arvind
>> On Tue, May 8, 2012 at 9:08 AM, S Ahmed <sahmed1020@gmail.com> wrote:
>>> Just want to make sure I understand the setup:
>>> 1. 9 hadoop servers that were fed the data
>>> 2. 1 server was used to generate the syslog data that was spread accross
>>> the 6 flume agent servers
>>> 3.  6 flume agent servers that collected data in memory and flushed to
>>> the 9 hadoop servers
>>> Is that right?
>>> On Tue, May 8, 2012 at 1:49 AM, Jarek Jarcec Cecho <jarcec@apache.org>wrote:
>>>> Thanks Mike,
>>>> this is in deed very helpful!
>>>> Jarcec
>>>> On Mon, May 07, 2012 at 06:55:49PM -0700, Mike Percy wrote:
>>>> > Hi folks,
>>>> > Will McQueen and I have been doing some Flume NG stress and
>>>> performance testing, and we wanted to share some of our recent findings.
>>>> The focus of the most recent tests has been on the syslog TCP source,
>>>> memory channel, and HDFS sink.
>>>> >
>>>> > I wrote some software to generate load in syslog format over TCP and
>>>> to automate some of the analysis. The first thing we wanted to verify is
>>>> that no data was lost during these tests (a.k.a. correctness), with a close
>>>> second priority being of course throughput (performance). I used Pig and
>>>> AvroStorage from piggybank in the data integrity analysis, and committed
>>>> the compiled (0.11 trunk) piggybank jar so the load analysis scripts would
>>>> be relatively easy to use. It seems to be compatible with Pig 0.8.1. I am
>>>> little wary of having to maintain that type of thing at the Apache org
>>>> level so for now I have checked all the code in on Github under an ASL 2.0
>>>> license:
>>>> >
>>>> > https://github.com/mpercy/flume-load-gen
>>>> >
>>>> > I have created a Wiki page with the performance metrics we have come
>>>> up with so far. The executive summary is that at the time of this writing,
>>>> we have observed Flume NG on a single machine processing events at a
>>>> throughput rate of 70,000+ events/sec with no data loss.
>>>> >
>>>> >
>>>> https://cwiki.apache.org/confluence/display/FLUME/Flume+NG+Performance+Measurements
>>>> >
>>>> > I have put more details on the wiki page itself. Please let me know
>>>> if you want me to add more detail. I'll be looking into improving the
>>>> performance of these components going forward, however we wanted to post
>>>> these results to set a public performance baseline of Flume NG.
>>>> >
>>>> > If others have done performance testing, we would love to see your
>>>> results if you can post the details.
>>>> >
>>>> > Regards,
>>>> > Mike
>>>> >

View raw message