Return-Path: X-Original-To: apmail-spark-issues-archive@minotaur.apache.org Delivered-To: apmail-spark-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5E23C18793 for ; Tue, 30 Jun 2015 01:34:05 +0000 (UTC) Received: (qmail 63614 invoked by uid 500); 30 Jun 2015 01:34:05 -0000 Delivered-To: apmail-spark-issues-archive@spark.apache.org Received: (qmail 63586 invoked by uid 500); 30 Jun 2015 01:34:05 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 63576 invoked by uid 99); 30 Jun 2015 01:34:05 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Jun 2015 01:34:05 +0000 Date: Tue, 30 Jun 2015 01:34:05 +0000 (UTC) From: "Andrew Or (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Reopened] (SPARK-8437) Using directory path without wildcard for filename slow for large number of files with wholeTextFiles and binaryFiles MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SPARK-8437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or reopened SPARK-8437: ------------------------------ > Using directory path without wildcard for filename slow for large number of files with wholeTextFiles and binaryFiles > --------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-8437 > URL: https://issues.apache.org/jira/browse/SPARK-8437 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 1.3.1, 1.4.0 > Environment: Ubuntu 15.04 + local filesystem > Amazon EMR + S3 + HDFS > Reporter: Ewan Leith > Assignee: Sean Owen > Priority: Minor > > When calling wholeTextFiles or binaryFiles with a directory path with 10,000s of files in it, Spark hangs for a few minutes before processing the files. > If you add a * to the end of the path, there is no delay. > This happens for me on Spark 1.3.1 and 1.4 on the local filesystem, HDFS, and on S3. > To reproduce, create a directory with 50,000 files in it, then run: > val a = sc.binaryFiles("file:/path/to/files/") > a.count() > val b = sc.binaryFiles("file:/path/to/files/*") > b.count() > and monitor the different startup times. > For example, in the spark-shell these commands are pasted in together, so the delay at f.count() is from 10:11:08 t- 10:13:29 to output "Total input paths to process : 49999", then until 10:15:42 to being processing files: > scala> val f = sc.binaryFiles("file:/home/ewan/large/") > 15/06/18 10:11:07 INFO MemoryStore: ensureFreeSpace(160616) called with curMem=0, maxMem=278019440 > 15/06/18 10:11:07 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 156.9 KB, free 265.0 MB) > 15/06/18 10:11:08 INFO MemoryStore: ensureFreeSpace(17282) called with curMem=160616, maxMem=278019440 > 15/06/18 10:11:08 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 16.9 KB, free 265.0 MB) > 15/06/18 10:11:08 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:40430 (size: 16.9 KB, free: 265.1 MB) > 15/06/18 10:11:08 INFO SparkContext: Created broadcast 0 from binaryFiles at :21 > f: org.apache.spark.rdd.RDD[(String, org.apache.spark.input.PortableDataStream)] = file:/home/ewan/large/ BinaryFileRDD[0] at binaryFiles at :21 > scala> f.count() > 15/06/18 10:13:29 INFO FileInputFormat: Total input paths to process : 49999 > 15/06/18 10:15:42 INFO FileInputFormat: Total input paths to process : 49999 > 15/06/18 10:15:42 INFO CombineFileInputFormat: DEBUG: Terminated node allocation with : CompletedNodes: 1, size left: 0 > 15/06/18 10:15:42 INFO SparkContext: Starting job: count at :24 > 15/06/18 10:15:42 INFO DAGScheduler: Got job 0 (count at :24) with 49999 output partitions (allowLocal=false) > 15/06/18 10:15:42 INFO DAGScheduler: Final stage: ResultStage 0(count at :24) > 15/06/18 10:15:42 INFO DAGScheduler: Parents of final stage: List() > Adding a * to the end of the path removes the delay: > scala> val f = sc.binaryFiles("file:/home/ewan/large/*") > 15/06/18 10:08:29 INFO MemoryStore: ensureFreeSpace(160616) called with curMem=0, maxMem=278019440 > 15/06/18 10:08:29 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 156.9 KB, free 265.0 MB) > 15/06/18 10:08:29 INFO MemoryStore: ensureFreeSpace(17309) called with curMem=160616, maxMem=278019440 > 15/06/18 10:08:29 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 16.9 KB, free 265.0 MB) > 15/06/18 10:08:29 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:42825 (size: 16.9 KB, free: 265.1 MB) > 15/06/18 10:08:29 INFO SparkContext: Created broadcast 0 from binaryFiles at :21 > f: org.apache.spark.rdd.RDD[(String, org.apache.spark.input.PortableDataStream)] = file:/home/ewan/large/* BinaryFileRDD[0] at binaryFiles at :21 > scala> f.count() > 15/06/18 10:08:32 INFO FileInputFormat: Total input paths to process : 49999 > 15/06/18 10:08:33 INFO FileInputFormat: Total input paths to process : 49999 > 15/06/18 10:08:35 INFO CombineFileInputFormat: DEBUG: Terminated node allocation with : CompletedNodes: 1, size left: 0 > 15/06/18 10:08:35 INFO SparkContext: Starting job: count at :24 > 15/06/18 10:08:35 INFO DAGScheduler: Got job 0 (count at :24) with 49999 output partitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org