flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raymond Ng <raymond...@gmail.com>
Subject Re: performance on RecoverableMemoryChannel vs JdbcChannel
Date Mon, 16 Jul 2012 06:36:58 GMT
Thanks for the advice and I've the summarised the points as follow

1) use of FileChannel

  - according to the User Guide 1.x this is not fully implemented yet,
  - will this provide recoverability and when will this be available?

2) batch capable source

  - the one that stands out immediately is AvroSource, but it'll need
something like a AvroSink to provide the batching and this doesn't work
with syslog in my scenario.

a "middle-man" component (similarly to AvroCLIClient) will need to be in
place to bridge the gap between the syslog and AvroSource, and it'll need
to be able to constantly "tail" new line and have the ability to generate
dynamic header such as timestamp, which is what Syslog souce is capable
3) should I raise a ticket regarding batch capability on event driven
source?

thanks
Ray


On Fri, Jul 13, 2012 at 2:33 AM, Juhani Connolly <
juhani_connolly@cyberagent.co.jp> wrote:

> It's the SyslogSource... Since it's an event driven source, it just sends
> single Events in commits.
>
> Raymond: if possible, try using a source where batching of events is
> possible. We're going to need to figure out some way to make this possible
> for event driven sources, but at the moment this isn't the case
> unfortunately.
>
>
> On 07/13/2012 12:46 AM, Brock Noland wrote:
>
>> Hi,
>>
>> I would use FileChannel as opposed to RecoverableMemoryChannel.
>>
>> Also, it sounds like your not batching somewhere since with batching
>> you will see a disk seek per event. 1000 ms / 100 events = 10 ms
>> (about a disk seek).
>>
>> Brock
>>
>> On Thu, Jul 12, 2012 at 3:55 PM, Raymond Ng <raymondair@gmail.com> wrote:
>>
>>> Hi
>>>
>>> I'm trying to investigate whether I can use flume for streaming syslog
>>> data
>>> on a production environemnt, and investigating which channel will give me
>>> durability and also performance
>>>
>>> I've tested using memory channel and the performance is good (i.e. with a
>>> 1GB JVM, achieving 9000 events / sec, with 1 agent with a syslog source
>>> hopping to another agent which has a hdfs sink)
>>>
>>> however durability and recoverability are also important when it comes to
>>> production solution, and it seems both Jdbc and RecoverableMemory
>>> channels
>>> offer significantly slow performance (no more than 100 events / sec).
>>>  Also
>>> RecoverableMemory channel doesn't seem to resume the streaming after the
>>> agents were restarted
>>>
>>> below is my agent configs, could you advice how I can improve the
>>> performance for both jdbc and recoverableMemoery channels, is it
>>> possible to
>>> config it to achieve half the performance figure that the memory channel
>>> can
>>> achieve?
>>>
>>> Agent with Syslog source
>>>
>>> agent.sources = SysLogSrc
>>> #agent.channels = MemChannel
>>> #agent.channels = JdbcChannel
>>> agent.channels = RecovMemChannel
>>> agent.sinks = AvroSink
>>>
>>> # SysLogSrc
>>> agent.sources.SysLogSrc.type = syslogtcp
>>> agent.sources.SysLogSrc.host = localhost
>>> agent.sources.SysLogSrc.port = 10902
>>> #agent.sources.SysLogSrc.**channels = MemChannel
>>> #agent.sources.SysLogSrc.**channels = JdbcChannel
>>> agent.sources.SysLogSrc.**channels = RecovMemChannel
>>> # MemChannel
>>> agent.channels.MemChannel.type = memory
>>> agent.channels.MemChannel.**capacity = 1000000
>>> agent.channels.MemChannel.**transactionCapacity = 10000
>>> agent.channels.MemChannel.**keep-alive = 3
>>> # JdbcChannel
>>> agent.channels.JdbcChannel.**type = jdbc
>>> agent.channels.JdbcChannel.db.**type = DERBY
>>> agent.channels.JdbcChannel.**driver.class =
>>> org.apache.derby.jdbc.**EmbeddedDriver
>>> agent.channels.JdbcChannel.**create.schema = true
>>> agent.channels.JdbcChannel.**create.index = true
>>> agent.channels.JdbcChannel.**create.foreignkey = true
>>> agent.channels.JdbcChannel.**maximum.connections = 10
>>> agent.channels.JdbcChannel.**maximum.capacity = 0
>>> agent.channels.JdbcChannel.**sysprop.user.home = /flume/data
>>> # RecovMemChannel
>>> agent.channels.**RecovMemChannel.type =
>>> org.apache.flume.channel.**recoverable.memory.**RecoverableMemoryChannel
>>> agent.channels.**RecovMemChannel.wal.dataDir =
>>> /flume/recoverable-memory-**channel
>>> agent.channels.**RecovMemChannel.wal.rollSize = 104857600
>>> agent.channels.**RecovMemChannel.wal.**minRetentionPeriod = 3600000
>>> agent.channels.**RecovMemChannel.wal.**workerInterval = 5000
>>> agent.channels.**RecovMemChannel.wal.**maxLogsSize = 1073741824
>>> agent.channels.**RecovMemChannel.capacity = 1000000
>>> agent.channels.**RecovMemChannel.**transactionCapacity = 10000
>>> agent.channels.**RecovMemChannel.keep-alive = 3
>>>
>>> # AvroSink
>>> agent.sinks.AvroSink.type = avro
>>> agent.sinks.AvroSink.hostname = 192.168.200.170
>>> agent.sinks.AvroSink.port = 10900
>>> agent.sinks.AvroSink.batch-**size = 10000
>>> #agent.sinks.AvroSink.channel = JdbcChannel
>>> #agent.sinks.AvroSink.channel = MemChannel
>>> agent.sinks.AvroSink.channel = RecovMemChannel
>>>
>>>
>>> Agent with HDFS sink
>>>
>>> agent.sources = AvroSrc
>>> #agent.channels = MemChannel
>>> #agent.channels = JdbcChannel
>>> agent.channels = RecovMemChannel
>>> agent.sinks = HdfsSink
>>> # AvroSrc
>>> agent.sources.AvroSrc.type = avro
>>> agent.sources.AvroSrc.bind = 192.168.200.170
>>> agent.sources.AvroSrc.port = 10900
>>> agent.sources.AvroSrc.channels = RecovMemChannel
>>> #agent.sources.AvroSrc.**channels = JdbcChannel
>>> #agent.sources.AvroSrc.**channels = MemChannel
>>> # MemChannel
>>> agent.channels.MemChannel.type = memory
>>> agent.channels.MemChannel.**capacity = 1000000
>>> agent.channels.MemChannel.**transactionCapacity = 10000
>>> agent.channels.MemChannel.**stay-alive = 3
>>> # JdbcChannel
>>> agent.channels.JdbcChannel.**type = jdbc
>>> agent.channels.JdbcChannel.db.**type = DERBY
>>> agent.channels.JdbcChannel.**driver.class =
>>> org.apache.derby.jdbc.**EmbeddedDriver
>>> agent.channels.JdbcChannel.**create.schema = true
>>> agent.channels.JdbcChannel.**create.index = true
>>> agent.channels.JdbcChannel.**create.foreignkey = true
>>> agent.channels.JdbcChannel.**maximum.connections = 10
>>> agent.channels.JdbcChannel.**maximum.capacity = 0
>>> agent.channels.JdbcChannel.**sysprop.user.home = /flume/data
>>> # RecovMemChannel
>>> agent.channels.**RecovMemChannel.type =
>>> org.apache.flume.channel.**recoverable.memory.**RecoverableMemoryChannel
>>> agent.channels.**RecovMemChannel.wal.dataDir =
>>> /flume/recoverable-memory-**channel
>>> agent.channels.**RecovMemChannel.wal.rollSize = 104857600
>>> agent.channels.**RecovMemChannel.wal.**minRetentionPeriod = 3600000
>>> agent.channels.**RecovMemChannel.wal.**workerInterval = 5000
>>> agent.channels.**RecovMemChannel.wal.**maxLogsSize = 1073741824
>>> agent.channels.**RecovMemChannel.capacity = 1000000
>>> agent.channels.**RecovMemChannel.**transactionCapacity = 10000
>>> agent.channels.**RecovMemChannel.keep-alive = 3
>>> # HdfsSink
>>> agent.sinks.HdfsSink.type = hdfs
>>> agent.sinks.HdfsSink.hdfs.path = hdfs://master:50070/data/flume
>>> agent.sinks.HdfsSink.hdfs.**filePrefix = data_%Y%m%d
>>> #agent.sinks.HdfsSink.channel = MemChannel
>>> #agent.sinks.HdfsSink.channel = JdbcChannel
>>> agent.sources.AvroSrc.channels = RecovMemChannel
>>> agent.sinks.HdfsSink.hdfs.**rollInterval = 300
>>> agent.sinks.HdfsSink.hdfs.**rollSize = 209715200
>>> agent.sinks.HdfsSink.hdfs.**rollCount = 0
>>> agent.sinks.HdfsSink.hdfs.**batchSize = 1000
>>> agent.sinks.HdfsSink.hdfs.**writeFormat = Text
>>> agent.sinks.HdfsSink.hdfs.**fileType = DataStream
>>>
>>> --
>>> Rgds
>>> Ray
>>>
>>
>>
>>
>
>


-- 
Rgds
Ray

Mime
View raw message