Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id AE7A4200B73 for ; Mon, 29 Aug 2016 17:46:23 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id AACE7160A89; Mon, 29 Aug 2016 15:46:23 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id CAC6E160AC8 for ; Mon, 29 Aug 2016 17:46:22 +0200 (CEST) Received: (qmail 40596 invoked by uid 500); 29 Aug 2016 15:46:21 -0000 Mailing-List: contact issues-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list issues@flink.apache.org Received: (qmail 40342 invoked by uid 99); 29 Aug 2016 15:46:20 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Aug 2016 15:46:20 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id E1B472C0157 for ; Mon, 29 Aug 2016 15:46:20 +0000 (UTC) Date: Mon, 29 Aug 2016 15:46:20 +0000 (UTC) From: "Maximilian Michels (JIRA)" To: issues@flink.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (FLINK-4485) Finished jobs in yarn session fill /tmp filesystem MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 29 Aug 2016 15:46:23 -0000 [ https://issues.apache.org/jira/browse/FLINK-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446231#comment-15446231 ] Maximilian Michels commented on FLINK-4485: ------------------------------------------- Thanks for the update. That's very important information. Have you ran a similar amount of jobs with 1.0.x and experienced this problem? In theory, the classloader of old jobs should be discarded when the ExecutionGraph is removed from the archive (after 5 new jobs have bee submitted). Perhaps we hold another reference to the Classloader in the Web UI which prevents it from getting garbage collected. > Finished jobs in yarn session fill /tmp filesystem > -------------------------------------------------- > > Key: FLINK-4485 > URL: https://issues.apache.org/jira/browse/FLINK-4485 > Project: Flink > Issue Type: Bug > Components: JobManager > Affects Versions: 1.1.0 > Reporter: Niels Basjes > Priority: Blocker > > On a Yarn cluster I start a yarn-session with a few containers and task slots. > Then I fire a 'large' number of Flink batch jobs in sequence against this yarn session. It is the exact same job (java code) yet it gets different parameters. > In this scenario it is exporting HBase tables to files in HDFS and the parameters are about which data from which tables and the name of the target directory. > After running several dozen jobs the jobs submission started to fail and we investigated. > We found that the cause was that on the Yarn node which was hosting the jobmanager the /tmp file system was full (4GB was 100% full). > How ever the output of {{du -hcs /tmp}} showed only 200MB in use. > We found that a very large file (we guess it is the jar of the job) was put in /tmp , used, deleted yet the file handle was not closed by the jobmanager. > As soon as we killed the jobmanager the disk space was freed. > The summary of the impact of this is that a yarn-session that receives enough jobs brings down the Yarn node for all users. > See parts of the output we got from {{lsof}} below. > {code} > COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME > java 15034 nbasjes 550r REG 253,17 66219695 245 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000003 (deleted) > java 15034 nbasjes 551r REG 253,17 66219695 252 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000007 (deleted) > java 15034 nbasjes 552r REG 253,17 66219695 267 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000012 (deleted) > java 15034 nbasjes 553r REG 253,17 66219695 250 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000005 (deleted) > java 15034 nbasjes 554r REG 253,17 66219695 288 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000018 (deleted) > java 15034 nbasjes 555r REG 253,17 66219695 298 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000025 (deleted) > java 15034 nbasjes 557r REG 253,17 66219695 254 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000008 (deleted) > java 15034 nbasjes 558r REG 253,17 66219695 292 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000019 (deleted) > java 15034 nbasjes 559r REG 253,17 66219695 275 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000013 (deleted) > java 15034 nbasjes 560r REG 253,17 66219695 159 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000002 (deleted) > java 15034 nbasjes 562r REG 253,17 66219695 238 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000001 (deleted) > java 15034 nbasjes 568r REG 253,17 66219695 246 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000004 (deleted) > java 15034 nbasjes 569r REG 253,17 66219695 255 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000009 (deleted) > java 15034 nbasjes 571r REG 253,17 66219695 299 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000026 (deleted) > java 15034 nbasjes 572r REG 253,17 66219695 293 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000020 (deleted) > java 15034 nbasjes 574r REG 253,17 66219695 256 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000010 (deleted) > java 15034 nbasjes 575r REG 253,17 66219695 302 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000029 (deleted) > java 15034 nbasjes 576r REG 253,17 66219695 294 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000021 (deleted) > java 15034 nbasjes 577r REG 253,17 66219695 262 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000011 (deleted) > java 15034 nbasjes 578r REG 253,17 66219695 251 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000006 (deleted) > java 15034 nbasjes 580r REG 253,17 66219695 295 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000022 (deleted) > java 15034 nbasjes 581r REG 253,17 66219695 300 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000027 (deleted) > java 15034 nbasjes 582r REG 253,17 66219695 188 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/cache/blob_e318d1698aa6e7dc91e5f4a9f8ba29781aebd8c4 (deleted) > java 15034 nbasjes 585r REG 253,17 66219695 279 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000014 (deleted) > java 15034 nbasjes 586r REG 253,17 66219695 296 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000023 (deleted) > java 15034 nbasjes 588r REG 253,17 66219695 301 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000028 (deleted) > java 15034 nbasjes 589r REG 253,17 66219695 297 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000024 (deleted) > java 15034 nbasjes 598r REG 253,17 66219695 280 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000015 (deleted) > java 15034 nbasjes 601r REG 253,17 66219695 289 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000016 (deleted) > java 15034 nbasjes 604r REG 253,17 66219695 284 /tmp/blobStore-fbe9c4cf-1f85-48cb-aad9-180e8d4ec7ce/incoming/temp-00000017 (deleted) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)