Return-Path: X-Original-To: apmail-flume-user-archive@www.apache.org Delivered-To: apmail-flume-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 27A47105B2 for ; Wed, 6 Nov 2013 19:42:54 +0000 (UTC) Received: (qmail 8092 invoked by uid 500); 6 Nov 2013 19:42:53 -0000 Delivered-To: apmail-flume-user-archive@flume.apache.org Received: (qmail 8013 invoked by uid 500); 6 Nov 2013 19:42:53 -0000 Mailing-List: contact user-help@flume.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flume.apache.org Delivered-To: mailing list user@flume.apache.org Received: (qmail 8004 invoked by uid 99); 6 Nov 2013 19:42:53 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Nov 2013 19:42:53 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of roshan@hortonworks.com designates 209.85.214.180 as permitted sender) Received: from [209.85.214.180] (HELO mail-ob0-f180.google.com) (209.85.214.180) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Nov 2013 19:42:48 +0000 Received: by mail-ob0-f180.google.com with SMTP id wo20so10690202obc.11 for ; Wed, 06 Nov 2013 11:42:27 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=pto2SrL3aYUUqy4PxinzWw3hGzATdB9p7/6vVn0J+Tw=; b=AyH3oaEVGoshdOTHASxETawYWER5Wg2GI0lL9aEO7K6JLD8RWG7n3DnRZgLlKFQaCs aVX2gPpWtSLRtdUBhW+UGlsB+bsB6hBacaRFd4P9U8Dc2xkQqTAIpV2ca+LWl2eOoh/9 XhZyJvRaCSYLji9d/Q1SVnVp1ikEfSIVbaJi8SaCY+yucWGr/Y0r/UE8z+7Ck4NMWILB Zp0ji8Uz3Kb/snx2pVBBJ/drjpr/DviENpqr+BjXC6k4LzBXDzfxk6Mn93LxYYDvcEWy ewXdQIMt+LXamCfpQlUq04oJoSAb5kHg/neONJPRbwRLbXdNjhhx8G9dzZ3jM+A+2HZu dr5A== X-Gm-Message-State: ALoCoQn6KtJLka/w+iKW2AGNSWzhHMJJhkLPSUSBeT2jeDZvCpec6h+sOu5JI2xpSTQUjAjfStT+SzbepGh/U+VpqjAblPaxRDlQOla0guDYvheDQCSEhGg= MIME-Version: 1.0 X-Received: by 10.60.93.67 with SMTP id cs3mr4007977oeb.12.1383766947715; Wed, 06 Nov 2013 11:42:27 -0800 (PST) Received: by 10.182.93.132 with HTTP; Wed, 6 Nov 2013 11:42:27 -0800 (PST) In-Reply-To: References: <5266200B.9020400@cyberagent.co.jp> Date: Wed, 6 Nov 2013 11:42:27 -0800 Message-ID: Subject: Re: Can Flume handle +100k events per seccond? From: Roshan Naik To: "user@flume.apache.org" Content-Type: multipart/alternative; boundary=047d7b33d1765ef71404ea8759d2 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b33d1765ef71404ea8759d2 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable A single hdfs sink should be able to write to multiple files if the events are annotated with 'host' header and %{host} escape sequence is used for the hdfs.path config. Depending on the host name value in each event's header, and the sink will write the event to the host specific file. You can have all the events coming in form a single server annotated with the hostname of that server. I am not sure if there is a way to ensure that each file on the source ends up as a separate file in HDFS. On Wed, Nov 6, 2013 at 1:39 AM, Bojan Kosti=C4=87 w= rote: > It was late when i wrote last mail, and my explanation was not clear. > I will illustrate: > 20 servers, every one with 60 different log files. > I was thinking that I could have this kind of structure on hdfs: > /logs/server0/logstat0.log > /logs/server0/logstat1.log > . > . > . > /logs/server20/logstat0.log > . > . > . > > But from your info I see that I can't do that. > I could try to add server id column in every file and then aggregate file= s > from all files servers to one file > /logs/logstat0.log > /logs/logstat1.log > . > . > . > > But again I should have 60 sinks. > On Nov 6, 2013 2:02 AM, "Roshan Naik" wrote: > >> I assume you mean you have 120 source files to be streamed into HDFS. >> There is not a 1-1 correspondence between source files and destination >> hdfs files. If they are on the same host, you can have them all picked = up >> through one source, once channel and one hdfs sink... winding up in a >> single hdfs file. >> >> In case you have a config with multiple HDFS sinks (part of a single >> agent or spanning multiple agents) you want to ensure each HDFS sink wri= tes >> to a separate file in HDFS. >> >> >> On Tue, Nov 5, 2013 at 4:41 PM, Bojan Kosti=C4=87 wrote: >> >>> Hallo Roshan, >>> >>> Thanks for response. >>> Bit I am now confused. If I have 120 files, do I need to configure 120 >>> sinks/sources/channels separately? Or I have missed something in the do= cs. >>> Maybe I should use Fan out flow? But then again I must set 120 params. >>> >>> Best regards. >>> On Nov 5, 2013 8:47 PM, "Roshan Naik" wrote: >>> >>>> yes. to avoid them clobbering each other's writes. >>>> >>>> >>>> On Tue, Nov 5, 2013 at 4:34 AM, Bojan Kosti=C4=87 wrote: >>>> >>>>> Sorry for late response. But I lost this email somehow. >>>>> >>>>> Thanks for the read, it is nice start even it is old. >>>>> And the numbers are really promising. >>>>> >>>>> I'm testing memory chanel, there is like 20 data sources(log servers) >>>>> with 60 different files each. >>>>> >>>>> My RPC client app is basic like in examples. But it have load >>>>> balancing for two flume agents which are writing data to hdfs. >>>>> >>>>> I think I read somewhere that you should have one sink per file. Is >>>>> that true? >>>>> >>>>> Best regards, and sorry again for late response. >>>>> On Oct 22, 2013 8:50 AM, "Juhani Connolly" < >>>>> juhani_connolly@cyberagent.co.jp> wrote: >>>>> >>>>>> Hi Bojan, >>>>>> >>>>>> This is pretty old, but Mike did some testing on performance about a= n >>>>>> year and a half ago: >>>>>> >>>>>> https://cwiki.apache.org/confluence/display/FLUME/ >>>>>> Flume+NG+Syslog+Performance+Test+2012-04-30 >>>>>> >>>>>> He was getting a max of 70k events/sec on a single machine. >>>>>> >>>>>> Thing is, this is a result of a huge number of variables: >>>>>> - Parallelization of flows allows better parallel processing >>>>>> - Use of memory channel as opposed to a slower consistent channel. >>>>>> - Possibly the source. I have no idea how you wrote your app >>>>>> - Batching of events is important. Also are all events written to on= e >>>>>> file? Or are they split over many? Every file is separately processe= d. >>>>>> - Network congestion, your hdfs setup >>>>>> >>>>>> Reaching 100k events per second is definitely possible. The resource= s >>>>>> you need for it will vary significantly depending on how your setup = is. The >>>>>> more HA type features you use, the slower delivery is likely to beco= me. On >>>>>> the flipside, allowing fairly lax conditions that have a small poten= tial >>>>>> for data loss(on crash for example memory channel contents are gone)= will >>>>>> allow for close to 100k even on a single machine. >>>>>> >>>>>> On 10/14/2013 09:00 PM, Bojan Kosti=C4=87 wrote: >>>>>> >>>>>>> Hi, this is my first post here. But i play with flume for some time >>>>>>> now. >>>>>>> My question is how well flume scale? >>>>>>> Can Flume ingest +100k events per seccond? Has anyone tried >>>>>>> something like this? >>>>>>> >>>>>>> I created simple test and results are really slow. >>>>>>> I wrote simple app with rpc client with fallback using flume sdk >>>>>>> which is reading dummy log file. >>>>>>> In the end i have two flume agents which are writing to hdfs. >>>>>>> rollInterval =3D 60 >>>>>>> And in hdfs i get files with ~12MB. >>>>>>> >>>>>>> Do i need to use some complex topology with 3 tier? >>>>>>> How many flume agents should write to hdfs? >>>>>>> >>>>>>> Best regards. >>>>>>> >>>>>> >>>>>> >>>> >>>> CONFIDENTIALITY NOTICE >>>> NOTICE: This message is intended for the use of the individual or >>>> entity to which it is addressed and may contain information that is >>>> confidential, privileged and exempt from disclosure under applicable l= aw. >>>> If the reader of this message is not the intended recipient, you are h= ereby >>>> notified that any printing, copying, dissemination, distribution, >>>> disclosure or forwarding of this communication is strictly prohibited.= If >>>> you have received this communication in error, please contact the send= er >>>> immediately and delete it from your system. Thank You. >>> >>> >> >> CONFIDENTIALITY NOTICE >> NOTICE: This message is intended for the use of the individual or entity >> to which it is addressed and may contain information that is confidentia= l, >> privileged and exempt from disclosure under applicable law. If the reade= r >> of this message is not the intended recipient, you are hereby notified t= hat >> any printing, copying, dissemination, distribution, disclosure or >> forwarding of this communication is strictly prohibited. If you have >> received this communication in error, please contact the sender immediat= ely >> and delete it from your system. Thank You. > > --=20 CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to= =20 which it is addressed and may contain information that is confidential,=20 privileged and exempt from disclosure under applicable law. If the reader= =20 of this message is not the intended recipient, you are hereby notified that= =20 any printing, copying, dissemination, distribution, disclosure or=20 forwarding of this communication is strictly prohibited. If you have=20 received this communication in error, please contact the sender immediately= =20 and delete it from your system. Thank You. --047d7b33d1765ef71404ea8759d2 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
A=C2=A0single hdfs sink should be able to write to mu= ltiple files if the events are annotated with 'host' header and=C2= =A0%{host= } escape sequence is used for the=C2=A0hdfs.path=C2=A0config. Depending on the host name value in each event's = header, and the sink will write the event to the host specific file.=
You can have all the= events coming in form a single server annotated with the hostname of that = server.=C2=A0
I am not sure if there is a way to ensure that each file on the sour= ce ends up as a separate file in HDFS.


On Wed,= Nov 6, 2013 at 1:39 AM, Bojan Kosti=C4=87 <blood9raven@gmail.com&= gt; wrote:

It was late when i wrote last= mail, and my explanation was not clear.
I will illustrate:
20 servers, every one with 60 different log files.
I was thinking that I could have this kind of structure on hdfs:
/logs/server0/logstat0.log
/logs/server0/logstat1.log
.
.
.
/logs/server20/logstat0.log
.
.
.

But from your info I see that I can't do that.
I could try to add server id column in every file and then aggregate files = from all files servers to one file
/logs/logstat0.log
/logs/logstat1.log
.
.
.

But again I should have 60 sinks.

<= div class=3D"h5">
On Nov 6, 2013 2:02 AM, "Roshan Naik" = <roshan@hort= onworks.com> wrote:
I assume you mean =C2=A0you have 120 source files to be st= reamed into HDFS.=C2=A0
There is not a 1-1 correspondence between sourc= e files and destination hdfs files. =C2=A0If they are on the same host, you= can have them all picked up through one source, once channel and one hdfs = sink... winding up in a single hdfs file.=C2=A0

In case you have a config with multiple HDFS sinks (par= t of a single agent or spanning multiple agents) you want to ensure each HD= FS sink writes to a separate file in HDFS.


On Tue, Nov 5, 2013 at 4:41 PM, Bojan Ko= sti=C4=87 <blood9raven@gmail.com> wrote:

Hallo Roshan,

Thanks for response.
Bit I am now confused. If I have 120 files, do I need to configure 120 sink= s/sources/channels separately? Or I have missed something in the docs.
Maybe I should use Fan out flow? But then again I must set 120 params.

Best regards.

On Nov 5, 2013 8:47 PM, "Roshan N= aik" <r= oshan@hortonworks.com> wrote:
yes. to avoid them clobbering each other's writes.


On Tue, Nov= 5, 2013 at 4:34 AM, Bojan Kosti=C4=87 <blood9raven@gmail.com><= /span> wrote:

Sorry for late response. But = I lost this email somehow.

Thanks for the read, it is nice start even it is old.
And the numbers are really promising.

I'm testing memory chanel, there is like 20 data sources= (log servers) with 60 different files each.

My RPC client app is basic like in examples. But it have loa= d balancing for two flume agents which are writing data to hdfs.

I think I read somewhere that you should have one sink per f= ile. Is that true?

Best regards, and sorry again for late response.

On Oct 22, 2013 8:50 AM, "Juhani Connolly&q= uot; <juhani_connolly@cyberagent.co.jp> wrote:
Hi Bojan,

This is pretty old, but Mike did some testing on performance about an year = and a half ago:

https://cwiki.apache.org/<= u>confluence/display/FLUME/Flume+NG+Syslog+Performance+Te= st+2012-04-30

He was getting a max of 70k events/sec on a single machine.

Thing is, this is a result of a huge number of variables:
- Parallelization of flows allows better parallel processing
- Use of memory channel as opposed to a slower consistent channel.
- Possibly the source. I have no idea how you wrote your app
- Batching of events is important. Also are all events written to one file?= Or are they split over many? Every file is separately processed.
- Network congestion, your hdfs setup

Reaching 100k events per second is definitely possible. The resources you n= eed for it will vary significantly depending on how your setup is. The more= HA type features you use, the slower delivery is likely to become. On the = flipside, allowing fairly lax conditions that have a small potential for da= ta loss(on crash for example memory channel contents are gone) will allow f= or close to 100k even on a single machine.

On 10/14/2013 09:00 PM, Bojan Kosti=C4=87 wrote:
Hi, this is my first post here. But i play with flume for some time now. My question is how well flume scale?
Can Flume ingest +100k events per seccond? Has anyone tried something like = this?

I created simple test and results are really slow.
I wrote simple app with rpc client with fallback using flume sdk which is r= eading dummy log file.
In the end i have two flume agents which are writing to hdfs.
rollInterval =3D 60
And in hdfs i get files with ~12MB.

Do i need to use some complex topology with 3 tier?
How many flume agents should write to hdfs?

Best regards.



CONFIDENTIALITY NOTICE
NOTICE: This= message is intended for the use of the individual or entity to which it is= addressed and may contain information that is confidential, privileged and= exempt from disclosure under applicable law. If the reader of this message= is not the intended recipient, you are hereby notified that any printing, = copying, dissemination, distribution, disclosure or forwarding of this comm= unication is strictly prohibited. If you have received this communication i= n error, please contact the sender immediately and delete it from your syst= em. Thank You.


CONFIDENTIALITY NOTICE
NOTICE: This message is = intended for the use of the individual or entity to which it is addressed a= nd may contain information that is confidential, privileged and exempt from= disclosure under applicable law. If the reader of this message is not the = intended recipient, you are hereby notified that any printing, copying, dis= semination, distribution, disclosure or forwarding of this communication is= strictly prohibited. If you have received this communication in error, ple= ase contact the sender immediately and delete it from your system. Thank Yo= u.


CONFIDENTIALITY NOTICE
NOTICE: This message is = intended for the use of the individual or entity to which it is addressed a= nd may contain information that is confidential, privileged and exempt from= disclosure under applicable law. If the reader of this message is not the = intended recipient, you are hereby notified that any printing, copying, dis= semination, distribution, disclosure or forwarding of this communication is= strictly prohibited. If you have received this communication in error, ple= ase contact the sender immediately and delete it from your system. Thank Yo= u. --047d7b33d1765ef71404ea8759d2--