Mailing-List: contact issues-help@spark.apache.org; run by ezmlm
Precedence: bulk
Date: Mon, 11 Sep 2017 01:50:02 +0000 (UTC)
From: "Apache Spark (JIRA)" <jira@apache.org>
To: issues@spark.apache.org
Message-ID: <JIRA.13101117.1505090932000.73374.1505094602526@Atlassian.JIRA>
In-Reply-To: <JIRA.13101117.1505090932000@Atlassian.JIRA>
References: <JIRA.13101117.1505090932000@Atlassian.JIRA> <JIRA.13101117.1505090932875@jira-lw-us.apache.org>
Subject: [jira] [Assigned] (SPARK-21971) Too many open files in Spark due to
 concurrent files being opened
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Mon, 11 Sep 2017 01:50:10 -0000


     [ https://issues.apache.org/jira/browse/SPARK-21971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-21971:
------------------------------------

    Assignee:     (was: Apache Spark)

> Too many open files in Spark due to concurrent files being opened
> -----------------------------------------------------------------
>
>                 Key: SPARK-21971
>                 URL: https://issues.apache.org/jira/browse/SPARK-21971
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 2.1.0
>            Reporter: Rajesh Balamohan
>            Priority: Minor
>
> When running Q67 of TPC-DS at 1 TB dataset on multi node cluster, it consistently fails with "too many open files" exception.
> {noformat}
> O scheduler.TaskSetManager: Finished task 25.0 in stage 844.0 (TID 243786) in 394 ms on machine111.xyz (executor 2) (189/200)
> 17/08/20 10:33:45 INFO scheduler.TaskSetManager: Finished task 172.0 in stage 844.0 (TID 243932) in 11996 ms on cn116-10.l42scl.hortonworks.com (executor 6) (190/200)
> 17/08/20 10:37:40 WARN scheduler.TaskSetManager: Lost task 144.0 in stage 844.0 (TID 243904, machine1.xyz, executor 1): java.nio.file.FileSystemException: /grid/3/hadoop/yarn/local/usercache/rbalamohan/appcache/application_1490656001509_7207/blockmgr-5180e3f0-f7ed-44bb-affc-8f99f09ba7bc/28/temp_local_690afbf7-172d-4fdb-8492-3e2ebd8d5183: Too many open files
>         at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
>         at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>         at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
>         at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177)
>         at java.nio.channels.FileChannel.open(FileChannel.java:287)
>         at java.nio.channels.FileChannel.open(FileChannel.java:335)
>         at org.apache.spark.io.NioBufferedFileInputStream.<init>(NioBufferedFileInputStream.java:43)
>         at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.<init>(UnsafeSorterSpillReader.java:75)
>         at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:150)
>         at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getIterator(UnsafeExternalSorter.java:607)
>         at org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.generateIterator(ExternalAppendOnlyUnsafeRowArray.scala:169)
>         at org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.generateIterator(ExternalAppendOnlyUnsafeRowArray.scala:173)
> {noformat}
> Cluster was configured with multiple cores per executor. 
> Window function uses "spark.sql.windowExec.buffer.spill.threshold=4096" which causes large number of spills in larger dataset. With multiple cores per executor, this reproduces easily. 
> {{UnsafeExternalSorter::getIterator()}} invokes {{spillWriter.getReader}} for all the available spillWriters. {{UnsafeSorterSpillReader}} opens up the file in its constructor and closes the file later as a part of its close() call. This causes too many open files issue.
> Note that this is not a file leak, but more of concurrent files being open at any given time depending on the dataset being processed.
> One option could be to increase "spark.sql.windowExec.buffer.spill.threshold" so that fewer spill files are generated, but it is hard to determine the sweetspot for all workload. Another option is to set ulimit to "unlimited" for files, but that would not be a good production setting. It would be good to consider reducing the number of concurrent "UnsafeExternalSorter::getIterator".


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org