flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-4485) Finished jobs in yarn session fill /tmp filesystem
Date Thu, 15 Sep 2016 11:04:20 GMT

    [ https://issues.apache.org/jira/browse/FLINK-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15493019#comment-15493019
] 

ASF GitHub Bot commented on FLINK-4485:
---------------------------------------

GitHub user mxm opened a pull request:

    https://github.com/apache/flink/pull/2499

    [FLINK-4485] close and remove user class loader after job completion

    Keeping the user class loader around after job completion may lead to
    excessive temp space usage because all user jars are kept until the
    class loader is garbage collected. Tests showed that garbage collection
    can be delayed for a long time after the class loader is not referenced
    anymore. Note that for the class loader to not be referenced anymore,
    its job has to be removed from the archive.
    
    The fastest way to minimize temp space usage is to close and remove the
    URLClassloader after job completion. This requires us to keep a
    serializable copy of all data which needs the user class loader after
    job completion, e.g. to display data on the web interface.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mxm/flink FLINK-4485

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/2499.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2499
    
----
commit 6ed17b9f5b9c13c80200ccf3db82bbfe727830bb
Author: Maximilian Michels <mxm@apache.org>
Date:   2016-09-15T09:00:58Z

    [FLINK-4485] close and remove user class loader after job completion
    
    Keeping the user class loader around after job completion may lead to
    excessive temp space usage because all user jars are kept until the
    class loader is garbage collected. Tests showed that garbage collection
    can be delayed for a long time after the class loader is not referenced
    anymore. Note that for the class loader to not be referenced anymore,
    its job has to be removed from the archive.
    
    The fastest way to minimize temp space usage is to close and remove the
    URLClassloader after job completion. This requires us to keep a
    serializable copy of all data which needs the user class loader after
    job completion, e.g. to display data on the web interface.

----


> Finished jobs in yarn session fill /tmp filesystem
> --------------------------------------------------
>
>                 Key: FLINK-4485
>                 URL: https://issues.apache.org/jira/browse/FLINK-4485
>             Project: Flink
>          Issue Type: Bug
>          Components: JobManager
>    Affects Versions: 1.1.0
>            Reporter: Niels Basjes
>            Assignee: Maximilian Michels
>            Priority: Blocker
>
> On a Yarn cluster I start a yarn-session with a few containers and task slots.
> Then I fire a 'large' number of Flink batch jobs in sequence against this yarn session.
It is the exact same job (java code) yet it gets different parameters.
> In this scenario it is exporting HBase tables to files in HDFS and the parameters are
about which data from which tables and the name of the target directory.
> After running several dozen jobs the jobs submission started to fail and we investigated.
> We found that the cause was that on the Yarn node which was hosting the jobmanager the
/tmp file system was full (4GB was 100% full).
> How ever the output of {{du -hcs /tmp}} showed only 200MB in use.
> We found that a very large file (we guess it is the jar of the job) was put in /tmp ,
used, deleted yet the file handle was not closed by the jobmanager.
> As soon as we killed the jobmanager the disk space was freed.
> The summary of the impact of this is that a yarn-session that receives enough jobs brings
down the Yarn node for all users.
> See parts of the output we got from {{lsof}} below.
> {code}
> COMMAND     PID      USER   FD      TYPE             DEVICE      SIZE       NODE NAME
> java      15034   nbasjes  550r      REG             253,17  66219695        245 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000003
(deleted)
> java      15034   nbasjes  551r      REG             253,17  66219695        252 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000007
(deleted)
> java      15034   nbasjes  552r      REG             253,17  66219695        267 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000012
(deleted)
> java      15034   nbasjes  553r      REG             253,17  66219695        250 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000005
(deleted)
> java      15034   nbasjes  554r      REG             253,17  66219695        288 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000018
(deleted)
> java      15034   nbasjes  555r      REG             253,17  66219695        298 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000025
(deleted)
> java      15034   nbasjes  557r      REG             253,17  66219695        254 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000008
(deleted)
> java      15034   nbasjes  558r      REG             253,17  66219695        292 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000019
(deleted)
> java      15034   nbasjes  559r      REG             253,17  66219695        275 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000013
(deleted)
> java      15034   nbasjes  560r      REG             253,17  66219695        159 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000002
(deleted)
> java      15034   nbasjes  562r      REG             253,17  66219695        238 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000001
(deleted)
> java      15034   nbasjes  568r      REG             253,17  66219695        246 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000004
(deleted)
> java      15034   nbasjes  569r      REG             253,17  66219695        255 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000009
(deleted)
> java      15034   nbasjes  571r      REG             253,17  66219695        299 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000026
(deleted)
> java      15034   nbasjes  572r      REG             253,17  66219695        293 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000020
(deleted)
> java      15034   nbasjes  574r      REG             253,17  66219695        256 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000010
(deleted)
> java      15034   nbasjes  575r      REG             253,17  66219695        302 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000029
(deleted)
> java      15034   nbasjes  576r      REG             253,17  66219695        294 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000021
(deleted)
> java      15034   nbasjes  577r      REG             253,17  66219695        262 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000011
(deleted)
> java      15034   nbasjes  578r      REG             253,17  66219695        251 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000006
(deleted)
> java      15034   nbasjes  580r      REG             253,17  66219695        295 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000022
(deleted)
> java      15034   nbasjes  581r      REG             253,17  66219695        300 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000027
(deleted)
> java      15034   nbasjes  582r      REG             253,17  66219695        188 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/cache/blob_e318d1698aa6e7dc91e5f4a9f8ba29781aebd8c4
(deleted)
> java      15034   nbasjes  585r      REG             253,17  66219695        279 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000014
(deleted)
> java      15034   nbasjes  586r      REG             253,17  66219695        296 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000023
(deleted)
> java      15034   nbasjes  588r      REG             253,17  66219695        301 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000028
(deleted)
> java      15034   nbasjes  589r      REG             253,17  66219695        297 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000024
(deleted)
> java      15034   nbasjes  598r      REG             253,17  66219695        280 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000015
(deleted)
> java      15034   nbasjes  601r      REG             253,17  66219695        289 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000016
(deleted)
> java      15034   nbasjes  604r      REG             253,17  66219695        284 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000017
(deleted)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message