Return-Path: X-Original-To: apmail-flume-user-archive@www.apache.org Delivered-To: apmail-flume-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BBE44DFAF for ; Mon, 16 Jul 2012 06:37:28 +0000 (UTC) Received: (qmail 37765 invoked by uid 500); 16 Jul 2012 06:37:28 -0000 Delivered-To: apmail-flume-user-archive@flume.apache.org Received: (qmail 37587 invoked by uid 500); 16 Jul 2012 06:37:27 -0000 Mailing-List: contact user-help@flume.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flume.apache.org Delivered-To: mailing list user@flume.apache.org Received: (qmail 37565 invoked by uid 99); 16 Jul 2012 06:37:26 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Jul 2012 06:37:26 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of raymondair@gmail.com designates 209.85.161.179 as permitted sender) Received: from [209.85.161.179] (HELO mail-gg0-f179.google.com) (209.85.161.179) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Jul 2012 06:37:19 +0000 Received: by ggnk3 with SMTP id k3so4528275ggn.38 for ; Sun, 15 Jul 2012 23:36:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=o9AMUb0xY0dKuaKpcLZJ1EorUDVP75sjNF+X4YpqBqk=; b=lLR1/IaUqBfa3ZUvYRMdrIXqNWLPmH6Clpw+GJ38iR35lI0f7S7vdjnUkC+c/qR8Qk 2MjcXuhNmtTtQDVtu2ZdSfW2Q8woS45Y54mKyVhg/IuulVwxGK108OUbHuACqGj+z3yF DGuVac6PSmOtFTrAY8Jmc6sH6aJVqE+8VIB+yaMvycde4jAj06x/FJy7KKy0QGyc8d2m 9rvvVBJUxOwyKuDLMiV8G1KjFS3BwOJ8UUH0NmjTK+7Do5ftGlUmVIyTor7EszNWbFNQ A90hygU5tRvWbrvp2ExA+6c4XTapVrmv8c17N5NpOTwCASEdE2kF1XAIA+s6VuM+sDVi uAYA== MIME-Version: 1.0 Received: by 10.236.174.66 with SMTP id w42mr8186999yhl.106.1342420618187; Sun, 15 Jul 2012 23:36:58 -0700 (PDT) Received: by 10.100.98.17 with HTTP; Sun, 15 Jul 2012 23:36:58 -0700 (PDT) In-Reply-To: <4FFF7B05.5030700@cyberagent.co.jp> References: <4FFF7B05.5030700@cyberagent.co.jp> Date: Mon, 16 Jul 2012 07:36:58 +0100 Message-ID: Subject: Re: performance on RecoverableMemoryChannel vs JdbcChannel From: Raymond Ng To: user@flume.apache.org Content-Type: multipart/alternative; boundary=20cf3056407116133104c4eca8db --20cf3056407116133104c4eca8db Content-Type: text/plain; charset=ISO-8859-1 Thanks for the advice and I've the summarised the points as follow 1) use of FileChannel - according to the User Guide 1.x this is not fully implemented yet, - will this provide recoverability and when will this be available? 2) batch capable source - the one that stands out immediately is AvroSource, but it'll need something like a AvroSink to provide the batching and this doesn't work with syslog in my scenario. a "middle-man" component (similarly to AvroCLIClient) will need to be in place to bridge the gap between the syslog and AvroSource, and it'll need to be able to constantly "tail" new line and have the ability to generate dynamic header such as timestamp, which is what Syslog souce is capable 3) should I raise a ticket regarding batch capability on event driven source? thanks Ray On Fri, Jul 13, 2012 at 2:33 AM, Juhani Connolly < juhani_connolly@cyberagent.co.jp> wrote: > It's the SyslogSource... Since it's an event driven source, it just sends > single Events in commits. > > Raymond: if possible, try using a source where batching of events is > possible. We're going to need to figure out some way to make this possible > for event driven sources, but at the moment this isn't the case > unfortunately. > > > On 07/13/2012 12:46 AM, Brock Noland wrote: > >> Hi, >> >> I would use FileChannel as opposed to RecoverableMemoryChannel. >> >> Also, it sounds like your not batching somewhere since with batching >> you will see a disk seek per event. 1000 ms / 100 events = 10 ms >> (about a disk seek). >> >> Brock >> >> On Thu, Jul 12, 2012 at 3:55 PM, Raymond Ng wrote: >> >>> Hi >>> >>> I'm trying to investigate whether I can use flume for streaming syslog >>> data >>> on a production environemnt, and investigating which channel will give me >>> durability and also performance >>> >>> I've tested using memory channel and the performance is good (i.e. with a >>> 1GB JVM, achieving 9000 events / sec, with 1 agent with a syslog source >>> hopping to another agent which has a hdfs sink) >>> >>> however durability and recoverability are also important when it comes to >>> production solution, and it seems both Jdbc and RecoverableMemory >>> channels >>> offer significantly slow performance (no more than 100 events / sec). >>> Also >>> RecoverableMemory channel doesn't seem to resume the streaming after the >>> agents were restarted >>> >>> below is my agent configs, could you advice how I can improve the >>> performance for both jdbc and recoverableMemoery channels, is it >>> possible to >>> config it to achieve half the performance figure that the memory channel >>> can >>> achieve? >>> >>> Agent with Syslog source >>> >>> agent.sources = SysLogSrc >>> #agent.channels = MemChannel >>> #agent.channels = JdbcChannel >>> agent.channels = RecovMemChannel >>> agent.sinks = AvroSink >>> >>> # SysLogSrc >>> agent.sources.SysLogSrc.type = syslogtcp >>> agent.sources.SysLogSrc.host = localhost >>> agent.sources.SysLogSrc.port = 10902 >>> #agent.sources.SysLogSrc.**channels = MemChannel >>> #agent.sources.SysLogSrc.**channels = JdbcChannel >>> agent.sources.SysLogSrc.**channels = RecovMemChannel >>> # MemChannel >>> agent.channels.MemChannel.type = memory >>> agent.channels.MemChannel.**capacity = 1000000 >>> agent.channels.MemChannel.**transactionCapacity = 10000 >>> agent.channels.MemChannel.**keep-alive = 3 >>> # JdbcChannel >>> agent.channels.JdbcChannel.**type = jdbc >>> agent.channels.JdbcChannel.db.**type = DERBY >>> agent.channels.JdbcChannel.**driver.class = >>> org.apache.derby.jdbc.**EmbeddedDriver >>> agent.channels.JdbcChannel.**create.schema = true >>> agent.channels.JdbcChannel.**create.index = true >>> agent.channels.JdbcChannel.**create.foreignkey = true >>> agent.channels.JdbcChannel.**maximum.connections = 10 >>> agent.channels.JdbcChannel.**maximum.capacity = 0 >>> agent.channels.JdbcChannel.**sysprop.user.home = /flume/data >>> # RecovMemChannel >>> agent.channels.**RecovMemChannel.type = >>> org.apache.flume.channel.**recoverable.memory.**RecoverableMemoryChannel >>> agent.channels.**RecovMemChannel.wal.dataDir = >>> /flume/recoverable-memory-**channel >>> agent.channels.**RecovMemChannel.wal.rollSize = 104857600 >>> agent.channels.**RecovMemChannel.wal.**minRetentionPeriod = 3600000 >>> agent.channels.**RecovMemChannel.wal.**workerInterval = 5000 >>> agent.channels.**RecovMemChannel.wal.**maxLogsSize = 1073741824 >>> agent.channels.**RecovMemChannel.capacity = 1000000 >>> agent.channels.**RecovMemChannel.**transactionCapacity = 10000 >>> agent.channels.**RecovMemChannel.keep-alive = 3 >>> >>> # AvroSink >>> agent.sinks.AvroSink.type = avro >>> agent.sinks.AvroSink.hostname = 192.168.200.170 >>> agent.sinks.AvroSink.port = 10900 >>> agent.sinks.AvroSink.batch-**size = 10000 >>> #agent.sinks.AvroSink.channel = JdbcChannel >>> #agent.sinks.AvroSink.channel = MemChannel >>> agent.sinks.AvroSink.channel = RecovMemChannel >>> >>> >>> Agent with HDFS sink >>> >>> agent.sources = AvroSrc >>> #agent.channels = MemChannel >>> #agent.channels = JdbcChannel >>> agent.channels = RecovMemChannel >>> agent.sinks = HdfsSink >>> # AvroSrc >>> agent.sources.AvroSrc.type = avro >>> agent.sources.AvroSrc.bind = 192.168.200.170 >>> agent.sources.AvroSrc.port = 10900 >>> agent.sources.AvroSrc.channels = RecovMemChannel >>> #agent.sources.AvroSrc.**channels = JdbcChannel >>> #agent.sources.AvroSrc.**channels = MemChannel >>> # MemChannel >>> agent.channels.MemChannel.type = memory >>> agent.channels.MemChannel.**capacity = 1000000 >>> agent.channels.MemChannel.**transactionCapacity = 10000 >>> agent.channels.MemChannel.**stay-alive = 3 >>> # JdbcChannel >>> agent.channels.JdbcChannel.**type = jdbc >>> agent.channels.JdbcChannel.db.**type = DERBY >>> agent.channels.JdbcChannel.**driver.class = >>> org.apache.derby.jdbc.**EmbeddedDriver >>> agent.channels.JdbcChannel.**create.schema = true >>> agent.channels.JdbcChannel.**create.index = true >>> agent.channels.JdbcChannel.**create.foreignkey = true >>> agent.channels.JdbcChannel.**maximum.connections = 10 >>> agent.channels.JdbcChannel.**maximum.capacity = 0 >>> agent.channels.JdbcChannel.**sysprop.user.home = /flume/data >>> # RecovMemChannel >>> agent.channels.**RecovMemChannel.type = >>> org.apache.flume.channel.**recoverable.memory.**RecoverableMemoryChannel >>> agent.channels.**RecovMemChannel.wal.dataDir = >>> /flume/recoverable-memory-**channel >>> agent.channels.**RecovMemChannel.wal.rollSize = 104857600 >>> agent.channels.**RecovMemChannel.wal.**minRetentionPeriod = 3600000 >>> agent.channels.**RecovMemChannel.wal.**workerInterval = 5000 >>> agent.channels.**RecovMemChannel.wal.**maxLogsSize = 1073741824 >>> agent.channels.**RecovMemChannel.capacity = 1000000 >>> agent.channels.**RecovMemChannel.**transactionCapacity = 10000 >>> agent.channels.**RecovMemChannel.keep-alive = 3 >>> # HdfsSink >>> agent.sinks.HdfsSink.type = hdfs >>> agent.sinks.HdfsSink.hdfs.path = hdfs://master:50070/data/flume >>> agent.sinks.HdfsSink.hdfs.**filePrefix = data_%Y%m%d >>> #agent.sinks.HdfsSink.channel = MemChannel >>> #agent.sinks.HdfsSink.channel = JdbcChannel >>> agent.sources.AvroSrc.channels = RecovMemChannel >>> agent.sinks.HdfsSink.hdfs.**rollInterval = 300 >>> agent.sinks.HdfsSink.hdfs.**rollSize = 209715200 >>> agent.sinks.HdfsSink.hdfs.**rollCount = 0 >>> agent.sinks.HdfsSink.hdfs.**batchSize = 1000 >>> agent.sinks.HdfsSink.hdfs.**writeFormat = Text >>> agent.sinks.HdfsSink.hdfs.**fileType = DataStream >>> >>> -- >>> Rgds >>> Ray >>> >> >> >> > > -- Rgds Ray --20cf3056407116133104c4eca8db Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Thanks for the advice and I've the summarised the points as follow=
=A0
1) use of FileChannel
=A0
=A0 - according to the User Guide 1.x this is not fully implemented ye= t,
=A0 - will this provide recoverability and when will this be available= ?
=A0
2) batch capable source
=A0
=A0 - the one that stands out immediately is AvroSource, but it'll= =A0need something like a AvroSink to provide the batching and this=A0doesn&= #39;t work with syslog in my scenario.
=A0
a "middle-man" component (similarly to AvroCLIClient) will n= eed to be in place to bridge the gap between the syslog and AvroSource,=A0a= nd it'll need to be able to constantly "tail" new line and ha= ve the ability to generate dynamic header such as timestamp, which is what = Syslog souce is capable
3) should I raise a ticket regarding batch capability on event driven = source?
=A0
thanks
Ray

=A0
On Fri, Jul 13, 2012 at 2:33 AM, Juhani Connolly= <juhani_connolly@cyberagent.co.jp> wrote:
It's the SyslogSource... Since it= 's an event driven source, it just sends single Events in commits.

Raymond: if possible, try using a source where batching of events is po= ssible. We're going to need to figure out some way to make this possibl= e for event driven sources, but at the moment this isn't the case unfor= tunately.=20


On 07/13/2012 12:46 AM, Brock Noland wrote:
Hi,

I would use FileChannel as= opposed to RecoverableMemoryChannel.

Also, it sounds like your not = batching somewhere since with batching
you will see a disk seek per event. 1000 ms / 100 events =3D 10 ms
(abou= t a disk seek).

Brock

On Thu, Jul 12, 2012 at 3:55 PM, Raymon= d Ng <raymonda= ir@gmail.com> wrote:
Hi

I'm trying to investiga= te whether I can use flume for streaming syslog data
on a production env= ironemnt, and investigating which channel will give me
durability and also performance

I've tested using memory channel= and the performance is good (i.e. with a
1GB JVM, achieving 9000 events= / sec, with 1 agent with a syslog source
hopping to another agent which= has a hdfs sink)

however durability and recoverability are also important when it comes = to
production solution, and it seems both Jdbc and RecoverableMemory cha= nnels
offer significantly slow performance (no more than 100 events / se= c). =A0Also
RecoverableMemory channel doesn't seem to resume the streaming after th= e
agents were restarted

below is my agent configs, could you advi= ce how I can improve the
performance for both jdbc and recoverableMemoer= y channels, is it possible to
config it to achieve half the performance figure that the memory channel ca= n
achieve?

Agent with Syslog source

agent.sources =3D SysL= ogSrc
#agent.channels =3D MemChannel
#agent.channels =3D JdbcChannel<= br> agent.channels =3D RecovMemChannel
agent.sinks =3D AvroSink

# Sys= LogSrc
agent.sources.SysLogSrc.type =3D syslogtcp
agent.sources.SysLo= gSrc.host =3D localhost
agent.sources.SysLogSrc.port =3D 10902
#agent= .sources.SysLogSrc.channels =3D MemChannel
#agent.sources.SysLogSrc.channels =3D JdbcChannel
agent.sources.S= ysLogSrc.channels =3D RecovMemChannel
# MemChannel
agent.chann= els.MemChannel.type =3D memory
agent.channels.MemChannel.capacity= =3D 1000000
agent.channels.MemChannel.transactionCapacity =3D 10000
agent.cha= nnels.MemChannel.keep-alive =3D 3
# JdbcChannel
agent.channels= .JdbcChannel.type =3D jdbc
agent.channels.JdbcChannel.db.t= ype =3D DERBY
agent.channels.JdbcChannel.driver.class =3D
org.apache.derby.jdbc= .EmbeddedDriver
agent.channels.JdbcChannel.create.schema = =3D true
agent.channels.JdbcChannel.create.index =3D true
agen= t.channels.JdbcChannel.create.foreignkey =3D true
agent.channels.JdbcChannel.maximum.connections =3D 10
agent.chann= els.JdbcChannel.maximum.capacity =3D 0
agent.channels.JdbcChannel= .sysprop.user.home =3D /flume/data
# RecovMemChannel
agent.cha= nnels.RecovMemChannel.type =3D
org.apache.flume.channel.recoverable.memory.RecoverableMemory= Channel
agent.channels.RecovMemChannel.wal.dataDir =3D
/flume/= recoverable-memory-channel
agent.channels.RecovMemChannel.= wal.rollSize =3D 104857600
agent.channels.RecovMemChannel.wal.minRetentionPeriod =3D 360= 0000
agent.channels.RecovMemChannel.wal.workerInterval =3D= 5000
agent.channels.RecovMemChannel.wal.maxLogsSize =3D 1= 073741824
agent.channels.RecovMemChannel.capacity =3D 1000000
agent.channel= s.RecovMemChannel.transactionCapacity =3D 10000
agent.chan= nels.RecovMemChannel.keep-alive =3D 3

# AvroSink
agent.sin= ks.AvroSink.type =3D avro
agent.sinks.AvroSink.hostname =3D 192.168.200.170
agent.sinks.AvroSink.p= ort =3D 10900
agent.sinks.AvroSink.batch-size =3D 10000
#agent= .sinks.AvroSink.channel =3D JdbcChannel
#agent.sinks.AvroSink.channel = =3D MemChannel
agent.sinks.AvroSink.channel =3D RecovMemChannel


Agent with HDFS= sink

agent.sources =3D AvroSrc
#agent.channels =3D MemChannel#agent.channels =3D JdbcChannel
agent.channels =3D RecovMemChannel
a= gent.sinks =3D HdfsSink
# AvroSrc
agent.sources.AvroSrc.type =3D avro
agent.sources.AvroSrc.b= ind =3D 192.168.200.170
agent.sources.AvroSrc.port =3D 10900
agent.so= urces.AvroSrc.channels =3D RecovMemChannel
#agent.sources.AvroSrc.channels =3D JdbcChannel
#agent.sources.AvroSrc.channels =3D MemChannel
# MemChannel
ag= ent.channels.MemChannel.type =3D memory
agent.channels.MemChannel.capacity =3D 1000000
agent.channels.MemChannel.transactionCapaci= ty =3D 10000
agent.channels.MemChannel.stay-alive =3D 3
# JdbcChannel
agent= .channels.JdbcChannel.type =3D jdbc
agent.channels.JdbcChannel.db= .type =3D DERBY
agent.channels.JdbcChannel.driver.class = =3D
org.apache.derby.jdbc.EmbeddedDriver
agent.channels.JdbcChannel.<= u>
create.schema =3D true
agent.channels.JdbcChannel.create.in= dex =3D true
agent.channels.JdbcChannel.create.foreignkey =3D tru= e
agent.channels.JdbcChannel.maximum.connections =3D 10
agent.chann= els.JdbcChannel.maximum.capacity =3D 0
agent.channels.JdbcChannel= .sysprop.user.home =3D /flume/data
# RecovMemChannel
agent.cha= nnels.RecovMemChannel.type =3D
org.apache.flume.channel.recoverable.memory.RecoverableMemory= Channel
agent.channels.RecovMemChannel.wal.dataDir =3D
/flume/= recoverable-memory-channel
agent.channels.RecovMemChannel.= wal.rollSize =3D 104857600
agent.channels.RecovMemChannel.wal.minRetentionPeriod =3D 360= 0000
agent.channels.RecovMemChannel.wal.workerInterval =3D= 5000
agent.channels.RecovMemChannel.wal.maxLogsSize =3D 1= 073741824
agent.channels.RecovMemChannel.capacity =3D 1000000
agent.channel= s.RecovMemChannel.transactionCapacity =3D 10000
agent.chan= nels.RecovMemChannel.keep-alive =3D 3
# HdfsSink
agent.sinks.H= dfsSink.type =3D hdfs
agent.sinks.HdfsSink.hdfs.path =3D hdfs://master:50070/data/flume
agent.= sinks.HdfsSink.hdfs.filePrefix =3D data_%Y%m%d
#agent.sinks.HdfsS= ink.channel =3D MemChannel
#agent.sinks.HdfsSink.channel =3D JdbcChannel=
agent.sources.AvroSrc.channels =3D RecovMemChannel
agent.sinks.HdfsSink.= hdfs.rollInterval =3D 300
agent.sinks.HdfsSink.hdfs.rollSi= ze =3D 209715200
agent.sinks.HdfsSink.hdfs.rollCount =3D 0
age= nt.sinks.HdfsSink.hdfs.batchSize =3D 1000
agent.sinks.HdfsSink.hdfs.writeFormat =3D Text
agent.sinks.HdfsSi= nk.hdfs.fileType =3D DataStream

--
Rgds
Ray







--
Rgds
Ray
--20cf3056407116133104c4eca8db--