flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmed Vila <av...@devlogic.eu>
Subject Re: Flume 1.4 High CPU
Date Thu, 16 Oct 2014 08:26:14 GMT
Hi Mike,

There seems to be something wrong with your server as 10 of 16 CPU threads
you have are running near 100% system. The other 6 are getting hard time
with IOWait.
A blog post like this proves to be valuable when dealing with high sys
usage, so you can determine what causes it:
http://newspaint.wordpress.com/2013/07/24/how-to-diagnose-high-sys-cpu-on-linux/

Since some of the threads are high on iowait, I would presume that others
are just stuck in read or write system call what would explain high sys
usage.
That leads to narrowed down problem to disks IO troughput.

Of course, that might be produced by high number of Flume events - every
system can be overwhelmed.

Thus, if you could run "iostat 1" in a separate terminal and then run your
test again. Then, we can verify if the disks are slower than average from
iostat output - you should expect at least 50 IOPS with 15k disk.
Given high IOwait, the system operating normally should give you the number
that high, so it will mean your hardware is just overwhelmed and not
malfunctioning.

If you're using hardware raid, please let us know in what constellation, as
you could achieve 50 IOPS with malfunctioning raid controller and yet still
have disks not utilized up to their max throughput capacity.

Anyway, switching over to memory channel will probably bring immediate
benefits, but it would also tell you if file channel is the one responsible.


Regards,
Ahmed


On Wed, Oct 15, 2014 at 11:51 PM, Mike Zupan <mike.zupan@manage.com> wrote:

> Ahmed,
>
> I’m pretty new to hadoop so I’m trying my best to debug this so I can’t
> pull the events yet.
>
> We are on 15k disks across the board but your uncompress then compress led
> me to what I think is the right track. I’m going to try to send to the
> flume servers un-compressed and see if that helps. We are getting a lot of
> cpu wait when new files come in.
>
> For example
>
> Cpu0  : 14.8%us, 15.1%sy,  0.0%ni,  2.0%id, 65.1%wa,  0.0%hi,  3.0%si,
>  0.0%st
> Cpu1  :  4.0%us, 39.7%sy,  0.0%ni, 34.0%id, 22.3%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu2  :  0.3%us, 97.4%sy,  0.0%ni,  2.0%id,  0.0%wa,  0.0%hi,  0.3%si,
>  0.0%st
> Cpu3  :  2.3%us, 75.5%sy,  0.0%ni, 13.2%id,  8.9%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu4  :  1.3%us, 51.8%sy,  0.0%ni, 30.9%id, 15.9%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu5  :  0.0%us,100.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu6  :  0.0%us, 99.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.3%si,
>  0.0%st
> Cpu7  :  4.0%us, 40.7%sy,  0.0%ni, 41.7%id, 13.6%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu8  :  0.3%us, 99.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.3%si,
>  0.0%st
> Cpu9  :  0.0%us,100.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu10 :  0.0%us, 99.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.3%si,
>  0.0%st
> Cpu11 :  2.0%us, 72.0%sy,  0.0%ni,  4.0%id, 22.0%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu12 :  5.0%us, 33.3%sy,  0.0%ni, 26.3%id, 35.3%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu13 :  0.0%us,100.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu14 :  0.0%us,100.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu15 :  0.0%us, 99.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.3%si,
>  0.0%st
> Cpu16 :  0.0%us,100.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>  0.0%st
>
> Thanks
>
> --
> Mike Zupan
>
> On Wednesday, October 15, 2014 at 2:27 PM, Ahmed Vila wrote:
>
> Hi Mike,
>
> It would be really helpful to provide number of events entering the source.
>
> Also, provided CPU utilization from top, the line that breaks down
> utilization by user/system/iowait/idle.
> If it has higher iowait then it might be that channel is utilizing more IO
> than your storage can handle - especially if it's an NFS or iSCSI mount.
> But, the most dependent factor is number of events.
>
> I see that you actually un-compress the event on arrival to the source and
> compress it back at the sink.
> It's well known that compression/decompression is above all CPU-bound task.
> That might be a problem and reduce flume throughput greatly, especially
> because you have 4 sinks each doing compression on it's own.
>
> Regards,
> Ahmed Vila
>
> On Wed, Oct 15, 2014 at 5:32 PM, Mike Zupan <mike.zupan@manage.com> wrote:
>
>  I’m seeing issues with flume server using very high amounts of CPU. Just
> wondering if this is a common issue with a file channel. I’m pretty new to
> flume so sorry if this isn’t enough to debug the issue.
>
> Current top looks like
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  8509 root      20   0 22.0g 8.6g 675m S 1109.4 13.7   1682:45 java
>  8251 root      20   0 21.9g 8.3g 647m S 1083.5 13.2   1476:27 java
>  7593 root      20   0 12.4g 8.4g  18m S 1007.5 13.4   1866:18 java
>
> As you can see we have 3 out of 4 flume servers using 1000% cpu.
>
> Details are
>
> OS: CentOS 6.5
> Java: Oracle "1.7.0_45"
> Flume: flume-1.4.0.2.1.1.0-385.el6.noarch
>
> Our config for the server looks like this
>
> ###############################################
> # Agent configuration for transactional data
> ###############################################
> nontx_host07_agent01.sources = avro
> nontx_host07_agent01.channels = fc
> nontx_host07_agent01.sinks = hdfs_sink_01 hdfs_sink_02 hdfs_sink_03
> hdfs_sink_04
>
> ##################################################
> # info is published to port 9991
> ##################################################
> nontx_host07_agent01.sources.avro.type = avro
> nontx_host07_agent01.sources.avro.bind = 0.0.0.0
> nontx_host07_agent01.sources.avro.port = 9991
> nontx_host07_agent01.sources.avro.threads = 100
> nontx_host07_agent01.sources.avro.compression-type = deflate
> nontx_host07_agent01.sources.avro.interceptors = ts id
> nontx_host07_agent01.sources.avro.interceptors.ts.type = timestamp
> nontx_host07_agent01.sources.avro.interceptors.ts.preserveExisting = false
> nontx_host07_agent01.sources.avro.interceptors.id.type =
> org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
> nontx_host07_agent01.sources.avro.interceptors.id.preserveExisting = true
>
>
> ##################################################
> # The Channels
> ##################################################
> nontx_host07_agent01.channels.fc.type = file
> nontx_host07_agent01.channels.fc.checkpointDir =
> /flume/channels/checkpoint/nontx_host07_agent01
> nontx_host07_agent01.channels.fc.dataDirs =
> /flume/channels/data/nontx_host07_agent01
> nontx_host07_agent01.channels.fc.capacity = 140000000
> nontx_host07_agent01.channels.fc.transactionCapacity = 240000
>
> ##################################################
> # Sinks
> ##################################################
> nontx_host07_agent01.sinks.hdfs_sink_01.type = hdfs
> nontx_host07_agent01.sinks.hdfs_sink_01.hdfs.path =
> hdfs://cluster01:8020/flume/%{log_type}
> nontx_host07_agent01.sinks.hdfs_sink_01.hdfs.filePrefix =
> flume_nontx_host07_agent01_sink01_%Y%m%d%H
> nontx_host07_agent01.sinks.hdfs_sink_01.hdfs.inUsePrefix=_
> nontx_host07_agent01.sinks.hdfs_sink_01.hdfs.inUseSuffix=.tmp
> nontx_host07_agent01.sinks.hdfs_sink_01.hdfs.fileType = CompressedStream
> nontx_host07_agent01.sinks.hdfs_sink_01.hdfs.codeC = snappy
> nontx_host07_agent01.sinks.hdfs_sink_01.hdfs.rollSize = 0
> nontx_host07_agent01.sinks.hdfs_sink_01.hdfs.rollCount = 0
> nontx_host07_agent01.sinks.hdfs_sink_01.hdfs.rollInterval = 300
> nontx_host07_agent01.sinks.hdfs_sink_01.hdfs.idleTimeout = 30
> nontx_host07_agent01.sinks.hdfs_sink_01.hdfs.timeZone = America/Los_Angeles
> nontx_host07_agent01.sinks.hdfs_sink_01.hdfs.callTimeout = 30000
> nontx_host07_agent01.sinks.hdfs_sink_01.hdfs.batchSize = 50000
> nontx_host07_agent01.sinks.hdfs_sink_01.hdfs.round = true
> nontx_host07_agent01.sinks.hdfs_sink_01.hdfs.roundUnit = minute
> nontx_host07_agent01.sinks.hdfs_sink_01.hdfs.roundValue = 5
> nontx_host07_agent01.sinks.hdfs_sink_01.hdfs.threadsPoolSize = 2
> nontx_host07_agent01.sinks.hdfs_sink_01.serializer =
> com.manage.flume.serialization.HeaderAndBodyJsonEventSerializer$Builder
>
> --
> Mike Zupan
>
>
>
>
> ---------------------------------------------------------------------
> This e-mail and any attachment is for authorised use by the intended
> recipient(s) only. This email contains confidential information. It should
> not be copied, disclosed to, retained or used by, any party other than the
> intended recipient. Any unauthorised distribution, dissemination or copying
> of this E-mail or its attachments, and/or any use of any information
> contained in them, is strictly prohibited and may be illegal. If you are
> not an intended recipient then please promptly delete this e-mail and any
> attachment and all copies and inform the sender directly via email. Any
> emails that you send to us may be monitored by systems or persons other
> than the named communicant for the purposes of ascertaining whether the
> communication complies with the law and company policies.
>
>
>

-- 
---------------------------------------------------------------------
This e-mail and any attachment is for authorised use by the intended 
recipient(s) only. This email contains confidential information. It should 
not be copied, disclosed to, retained or used by, any party other than the 
intended recipient. Any unauthorised distribution, dissemination or copying 
of this E-mail or its attachments, and/or any use of any information 
contained in them, is strictly prohibited and may be illegal. If you are 
not an intended recipient then please promptly delete this e-mail and any 
attachment and all copies and inform the sender directly via email. Any 
emails that you send to us may be monitored by systems or persons other 
than the named communicant for the purposes of ascertaining whether the 
communication complies with the law and company policies.

Mime
View raw message