flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Piotr Nowojski (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (FLINK-8707) Excessive amount of files opened by flink task manager
Date Thu, 22 Feb 2018 15:14:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-8707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372919#comment-16372919
] 

Piotr Nowojski edited comment on FLINK-8707 at 2/22/18 3:13 PM:
----------------------------------------------------------------

If you take a look into the attached results of lsof, 60% are regular files:

 
{noformat}
cat box2-taskmgr-lsof | cut -c40-55 | sort -n | uniq -c
406       CHR       
116       DIR       
8294       REG       
3596      FIFO       
348      IPv6       
116      unix 0xffff
1798   a_inode {noformat}
and those files repeat them selves 116 times:

 
{noformat}
116 /opt/app/wily/agent/Agent.jar
116 /opt/app/wily/agent/core/ext/AppMap.jar
116 /opt/app/wily/agent/core/ext/BasicDirectiveLoader.jar
116 /opt/app/wily/agent/core/ext/BizDef.jar
116 /opt/app/wily/agent/core/ext/BizTrxHttp.jar
116 /opt/app/wily/agent/core/ext/ChangeDetector-Agent_Server.jar
116 /opt/app/wily/agent/core/ext/ChangeDetector-CommonAll.jar
116 /opt/app/wily/agent/core/ext/ChangeDetectorAgent.jar
116 /opt/app/wily/agent/core/ext/DynInstrBootstrap.jar
116 /opt/app/wily/agent/core/ext/DynInstrSupport15.jar
116 /opt/app/wily/agent/core/ext/GCMonitor.jar
116 /opt/app/wily/agent/core/ext/HPC-GcMonitorAgent.jar
116 /opt/app/wily/agent/core/ext/Inheritance.jar
116 /opt/app/wily/agent/core/ext/Java15DynamicInstrumentation.jar
116 /opt/app/wily/agent/core/ext/LeakHunter.jar
116 /opt/app/wily/agent/core/ext/ProbeBuilder.jar
116 /opt/app/wily/agent/core/ext/RegexNormalizerExtension.jar
116 /opt/app/wily/agent/core/ext/SQLAgent.jar
116 /opt/app/wily/agent/core/ext/ServletHeaderDecorator.jar
116 /opt/app/wily/agent/core/ext/ServletHelper.jar
116 /opt/app/wily/agent/core/ext/Supportability-Agent.jar
116 /opt/app/wily/agent/core/ext/ThreadDumpGen.jar
116 /opt/app/wily/agent/core/ext/TomcatMonitoring.jar
116 /opt/app/wily/agent/core/ext/WebAppSupport.jar
116 /opt/app/wily/agent/core/ext/introscopeAIXPSeries32Stats.jar
116 /opt/app/wily/agent/core/ext/introscopeAIXPSeries64Stats.jar
116 /opt/app/wily/agent/core/ext/introscopeHpuxItanium32Stats.jar
116 /opt/app/wily/agent/core/ext/introscopeHpuxItanium64Stats.jar
116 /opt/app/wily/agent/core/ext/introscopeHpuxParisc32Stats.jar
116 /opt/app/wily/agent/core/ext/introscopeHpuxParisc64Stats.jar
116 /opt/app/wily/agent/core/ext/introscopeLinuxIntelAmd32Stats.jar
116 /opt/app/wily/agent/core/ext/introscopeLinuxIntelAmd64Stats.jar
116 /opt/app/wily/agent/core/ext/introscopeSolarisAmd32Stats.jar
116 /opt/app/wily/agent/core/ext/introscopeSolarisAmd64Stats.jar
116 /opt/app/wily/agent/core/ext/introscopeSolarisSparc32Stats.jar
116 /opt/app/wily/agent/core/ext/introscopeSolarisSparc64Stats.jar
116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/flink-dist_2.11-1.3.2.jar
116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/flink-python_2.11-1.3.2.jar
116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/flink-shaded-hadoop2-uber-1.3.2.jar
116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/log4j-over-slf4j-1.7.25.jar
116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/logback-classic-1.2.3.jar
116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/logback-core-1.2.3.jar
116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/slf4j-api-1.7.25.jar
116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/xxxxx-flink-job-monitor-1.21-20171130.111758-2.jar
116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/xxxxx-flink-job-monitor-1.27-20171205.110224-2.jar
116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/xxxxx-flink-job-monitor-1.28.jar
116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/xxxxx-flink-job-monitor-1.30.jar
116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/xxxxx-flink-job-monitor-1.32.jar
116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/xxxxx-flink-job-monitor-1.33.jar
116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/xxxxx-flink-job-monitor-1.35.jar
116 /opt/app/xxxxx/dev/pkgs/flink/var/log/flink-flinkuser-taskmanager-0-box2.out
116 /usr/java/jdk1.8.0_131/jre/lib/ext/sunec.jar
116 /usr/java/jdk1.8.0_131/jre/lib/ext/sunpkcs11.jar
116 /usr/java/jdk1.8.0_131/jre/lib/jce.jar
116 /usr/java/jdk1.8.0_131/jre/lib/jsse.jar
116 /usr/java/jdk1.8.0_131/jre/lib/resources.jar
116 /usr/java/jdk1.8.0_131/jre/lib/rt.jar
{noformat}


was (Author: pnowojski):
If you take a look into the attached results of lsof, 60% are regular files:

 
{noformat}
cat box2-taskmgr-lsof | cut -c40-55 | sort -n | uniq -c
406       CHR       
116       DIR       
8294       REG       
3596      FIFO       
348      IPv6       
116      unix 0xffff
1798   a_inode {noformat}
and those files repeat them selves 116 times:

 
{noformat}
116  /opt/app/wily/agent/Agent.jar 116  /opt/app/wily/agent/core/ext/AppMap.jar 116  /opt/app/wily/agent/core/ext/BasicDirectiveLoader.jar
116  /opt/app/wily/agent/core/ext/BizDef.jar 116  /opt/app/wily/agent/core/ext/BizTrxHttp.jar
116  /opt/app/wily/agent/core/ext/ChangeDetector-Agent_Server.jar 116  /opt/app/wily/agent/core/ext/ChangeDetector-CommonAll.jar
116  /opt/app/wily/agent/core/ext/ChangeDetectorAgent.jar 116  /opt/app/wily/agent/core/ext/DynInstrBootstrap.jar
116  /opt/app/wily/agent/core/ext/DynInstrSupport15.jar 116  /opt/app/wily/agent/core/ext/GCMonitor.jar
116  /opt/app/wily/agent/core/ext/HPC-GcMonitorAgent.jar 116  /opt/app/wily/agent/core/ext/Inheritance.jar
116  /opt/app/wily/agent/core/ext/Java15DynamicInstrumentation.jar 116  /opt/app/wily/agent/core/ext/LeakHunter.jar
116  /opt/app/wily/agent/core/ext/ProbeBuilder.jar 116  /opt/app/wily/agent/core/ext/RegexNormalizerExtension.jar
116  /opt/app/wily/agent/core/ext/SQLAgent.jar 116  /opt/app/wily/agent/core/ext/ServletHeaderDecorator.jar
116  /opt/app/wily/agent/core/ext/ServletHelper.jar 116  /opt/app/wily/agent/core/ext/Supportability-Agent.jar
116  /opt/app/wily/agent/core/ext/ThreadDumpGen.jar 116  /opt/app/wily/agent/core/ext/TomcatMonitoring.jar
116  /opt/app/wily/agent/core/ext/WebAppSupport.jar 116  /opt/app/wily/agent/core/ext/introscopeAIXPSeries32Stats.jar
116  /opt/app/wily/agent/core/ext/introscopeAIXPSeries64Stats.jar 116  /opt/app/wily/agent/core/ext/introscopeHpuxItanium32Stats.jar
116  /opt/app/wily/agent/core/ext/introscopeHpuxItanium64Stats.jar 116  /opt/app/wily/agent/core/ext/introscopeHpuxParisc32Stats.jar
116  /opt/app/wily/agent/core/ext/introscopeHpuxParisc64Stats.jar 116  /opt/app/wily/agent/core/ext/introscopeLinuxIntelAmd32Stats.jar
116  /opt/app/wily/agent/core/ext/introscopeLinuxIntelAmd64Stats.jar 116  /opt/app/wily/agent/core/ext/introscopeSolarisAmd32Stats.jar
116  /opt/app/wily/agent/core/ext/introscopeSolarisAmd64Stats.jar 116  /opt/app/wily/agent/core/ext/introscopeSolarisSparc32Stats.jar
116  /opt/app/wily/agent/core/ext/introscopeSolarisSparc64Stats.jar 116  /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/flink-dist_2.11-1.3.2.jar
116  /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/flink-python_2.11-1.3.2.jar 116  /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/flink-shaded-hadoop2-uber-1.3.2.jar
116  /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/log4j-over-slf4j-1.7.25.jar 116  /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/logback-classic-1.2.3.jar
116  /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/logback-core-1.2.3.jar 116  /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/slf4j-api-1.7.25.jar
116  /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/xxxxx-flink-job-monitor-1.21-20171130.111758-2.jar
116  /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/xxxxx-flink-job-monitor-1.27-20171205.110224-2.jar
116  /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/xxxxx-flink-job-monitor-1.28.jar 116 
/opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/xxxxx-flink-job-monitor-1.30.jar 116  /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/xxxxx-flink-job-monitor-1.32.jar
116  /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/xxxxx-flink-job-monitor-1.33.jar 116 
/opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/xxxxx-flink-job-monitor-1.35.jar 116  /opt/app/xxxxx/dev/pkgs/flink/var/log/flink-flinkuser-taskmanager-0-box2.out
116  /usr/java/jdk1.8.0_131/jre/lib/ext/sunec.jar 116  /usr/java/jdk1.8.0_131/jre/lib/ext/sunpkcs11.jar
116  /usr/java/jdk1.8.0_131/jre/lib/jce.jar 116  /usr/java/jdk1.8.0_131/jre/lib/jsse.jar
116  /usr/java/jdk1.8.0_131/jre/lib/resources.jar 116  /usr/java/jdk1.8.0_131/jre/lib/rt.jar
{noformat}

> Excessive amount of files opened by flink task manager
> ------------------------------------------------------
>
>                 Key: FLINK-8707
>                 URL: https://issues.apache.org/jira/browse/FLINK-8707
>             Project: Flink
>          Issue Type: Bug
>          Components: TaskManager
>    Affects Versions: 1.3.2
>         Environment: NAME="Red Hat Enterprise Linux Server"
> VERSION="7.3 (Maipo)"
> Two boxes, each with a Job Manager & Task Manager, using Zookeeper for HA.
> flink.yaml below with some settings (removed exact box names) etc:
> env.log.dir: ...some dir...residing on the same box
> env.pid.dir: some dir...residing on the same box
> metrics.reporter.jmx.class: org.apache.flink.metrics.jmx.JMXReporter
> metrics.reporters: jmx
> state.backend: filesystem
> state.backend.fs.checkpointdir: file:///some_nfs_mount
> state.checkpoints.dir: file:///some_nfs_mount
> state.checkpoints.num-retained: 3
> high-availability.cluster-id: /tst
> high-availability.storageDir: file:///some_nfs_mount/ha
> high-availability: zookeeper
> high-availability.zookeeper.path.root: /flink
> high-availability.zookeeper.quorum: ...list of zookeeper boxes
> env.java.opts.jobmanager: ...some extra jar args
> jobmanager.archive.fs.dir: some dir...residing on the same box
> jobmanager.web.submit.enable: true
> jobmanager.web.tmpdir:  some dir...residing on the same box
> env.java.opts.taskmanager: some extra jar args
> taskmanager.tmp.dirs:  some dir...residing on the same box/var/tmp
> taskmanager.network.memory.min: 1024MB
> taskmanager.network.memory.max: 2048MB
> blob.storage.directory:  some dir...residing on the same box
>            Reporter: Alexander Gardner
>            Priority: Blocker
>             Fix For: 1.5.0
>
>         Attachments: box1-jobmgr-lsof, box1-taskmgr-lsof, box2-jobmgr-lsof, box2-taskmgr-lsof
>
>
> The job manager has less FDs than the task manager.
>  
> Hi
> A support alert indicated that there were a lot of open files for the boxes running Flink.
> There were 4 flink jobs that were dormant but had consumed a number of msgs from Kafka
using the FlinkKafkaConsumer010.
> A simple general lsof:
> $ lsof | wc -l       ->  returned 153114 open file descriptors.
> Focusing on the TaskManager process (process ID = 12154):
> $ lsof | grep 12154 | wc -l-    > returned 129322 open FDs
> $ lsof -p 12154 | wc -l   -> returned 531 FDs
> There were 228 threads running for the task manager.
>  
> Drilling down a bit further, looking at a_inode and FIFO entries: 
> $ lsof -p 12154 | grep a_inode | wc -l = 100 FDs
> $ lsof -p 12154 | grep FIFO | wc -l  = 200 FDs
> $ /proc/12154/maps = 920 entries.
> Apart from lsof identifying lots of JARs and SOs being referenced there were also 244
child processes for the task manager process.
> Noticed that in each environment, a creep of file descriptors...are the above figures
deemed excessive for the no of FDs in use? I know Flink uses Netty - is it using a separate
Selector for reads & writes? 
> Additionally Flink uses memory mapped files? or direct bytebuffers are these skewing
the numbers of FDs shown?
> Example of one child process ID 6633:
> java 12154 6633 dfdev 387u a_inode 0,9 0 5869 [eventpoll]
>  java 12154 6633 dfdev 388r FIFO 0,8 0t0 459758080 pipe
>  java 12154 6633 dfdev 389w FIFO 0,8 0t0 459758080 pipe
> Lasty, cannot identify yet the reason for the creep in FDs even if Flink is pretty dormant
or has dormant jobs. Production nodes are not experiencing excessive amounts of throughput
yet either.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message