Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 77D33200D06 for ; Mon, 11 Sep 2017 03:50:10 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 7682C1609BD; Mon, 11 Sep 2017 01:50:10 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id C43A11609C0 for ; Mon, 11 Sep 2017 03:50:09 +0200 (CEST) Received: (qmail 90942 invoked by uid 500); 11 Sep 2017 01:50:08 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 90933 invoked by uid 99); 11 Sep 2017 01:50:08 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 Sep 2017 01:50:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 6E8F5194C87 for ; Mon, 11 Sep 2017 01:50:08 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id tNctp3kxwkeT for ; Mon, 11 Sep 2017 01:50:07 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 5C03F5F6C8 for ; Mon, 11 Sep 2017 01:50:06 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id A965BE0E56 for ; Mon, 11 Sep 2017 01:50:04 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 813D924157 for ; Mon, 11 Sep 2017 01:50:02 +0000 (UTC) Date: Mon, 11 Sep 2017 01:50:02 +0000 (UTC) From: "Apache Spark (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Assigned] (SPARK-21971) Too many open files in Spark due to concurrent files being opened MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 11 Sep 2017 01:50:10 -0000 [ https://issues.apache.org/jira/browse/SPARK-21971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21971: ------------------------------------ Assignee: (was: Apache Spark) > Too many open files in Spark due to concurrent files being opened > ----------------------------------------------------------------- > > Key: SPARK-21971 > URL: https://issues.apache.org/jira/browse/SPARK-21971 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL > Affects Versions: 2.1.0 > Reporter: Rajesh Balamohan > Priority: Minor > > When running Q67 of TPC-DS at 1 TB dataset on multi node cluster, it consistently fails with "too many open files" exception. > {noformat} > O scheduler.TaskSetManager: Finished task 25.0 in stage 844.0 (TID 243786) in 394 ms on machine111.xyz (executor 2) (189/200) > 17/08/20 10:33:45 INFO scheduler.TaskSetManager: Finished task 172.0 in stage 844.0 (TID 243932) in 11996 ms on cn116-10.l42scl.hortonworks.com (executor 6) (190/200) > 17/08/20 10:37:40 WARN scheduler.TaskSetManager: Lost task 144.0 in stage 844.0 (TID 243904, machine1.xyz, executor 1): java.nio.file.FileSystemException: /grid/3/hadoop/yarn/local/usercache/rbalamohan/appcache/application_1490656001509_7207/blockmgr-5180e3f0-f7ed-44bb-affc-8f99f09ba7bc/28/temp_local_690afbf7-172d-4fdb-8492-3e2ebd8d5183: Too many open files > at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) > at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177) > at java.nio.channels.FileChannel.open(FileChannel.java:287) > at java.nio.channels.FileChannel.open(FileChannel.java:335) > at org.apache.spark.io.NioBufferedFileInputStream.(NioBufferedFileInputStream.java:43) > at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.(UnsafeSorterSpillReader.java:75) > at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:150) > at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getIterator(UnsafeExternalSorter.java:607) > at org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.generateIterator(ExternalAppendOnlyUnsafeRowArray.scala:169) > at org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.generateIterator(ExternalAppendOnlyUnsafeRowArray.scala:173) > {noformat} > Cluster was configured with multiple cores per executor. > Window function uses "spark.sql.windowExec.buffer.spill.threshold=4096" which causes large number of spills in larger dataset. With multiple cores per executor, this reproduces easily. > {{UnsafeExternalSorter::getIterator()}} invokes {{spillWriter.getReader}} for all the available spillWriters. {{UnsafeSorterSpillReader}} opens up the file in its constructor and closes the file later as a part of its close() call. This causes too many open files issue. > Note that this is not a file leak, but more of concurrent files being open at any given time depending on the dataset being processed. > One option could be to increase "spark.sql.windowExec.buffer.spill.threshold" so that fewer spill files are generated, but it is hard to determine the sweetspot for all workload. Another option is to set ulimit to "unlimited" for files, but that would not be a good production setting. It would be good to consider reducing the number of concurrent "UnsafeExternalSorter::getIterator". -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org