Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id D44CD200BD7 for ; Wed, 2 Nov 2016 15:23:00 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id D33AE160AEA; Wed, 2 Nov 2016 14:23:00 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 2A28F160B0C for ; Wed, 2 Nov 2016 15:23:00 +0100 (CET) Received: (qmail 66619 invoked by uid 500); 2 Nov 2016 14:22:58 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 66583 invoked by uid 99); 2 Nov 2016 14:22:58 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Nov 2016 14:22:58 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id A2C742C2A6A for ; Wed, 2 Nov 2016 14:22:58 +0000 (UTC) Date: Wed, 2 Nov 2016 14:22:58 +0000 (UTC) From: "Hyukjin Kwon (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (SPARK-12677) Lazy file discovery for parquet MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 02 Nov 2016 14:23:01 -0000 [ https://issues.apache.org/jira/browse/SPARK-12677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15629114#comment-15629114 ] Hyukjin Kwon commented on SPARK-12677: -------------------------------------- Ah, I see. IMHO, it might not be an issue as long as it gives a clear message as the similar reason with [~reactormonk]. I first thought you meant schema inference. In case of this, it does launch another job for schema inference[1] and I submitted a PR[2] to deal with this problem by allowing to read them in driver when they are few. [1]https://github.com/apache/spark/blob/77a98162d1ec28247053b8b3ad4af28baa950797/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L585-L593 [2]https://github.com/apache/spark/pull/14660 > Lazy file discovery for parquet > ------------------------------- > > Key: SPARK-12677 > URL: https://issues.apache.org/jira/browse/SPARK-12677 > Project: Spark > Issue Type: Wish > Components: SQL > Reporter: Tiago Albineli Motta > Priority: Minor > Labels: features > > When using sqlContext.read.parquet(files: _*) the driver verifyies if everything is ok with the files. But reading those files is lazy, so when it starts maybe the files are not there anymore, or they have changed, so we receive this error message: > {quote} > 16/01/06 10:52:43 ERROR yarn.ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 0.0 failed 4 times, most recent failure: Lost task 4.3 in stage 0.0 (TID 16, riolb586.globoi.com): java.io.FileNotFoundException: File does not exist: hdfs://mynamenode.com:8020/rec/prefs/2016/01/06/part-r-00003-27a100b0-ff49-45ad-8803-e6cc77286661.gz.parquet > at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309) > at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301) > at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317) > at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:381) > at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:155) > at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138) > at org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:153) > at org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124) > at org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:242) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {quote} > Maybe if sqlContext.read.parquet could receive a Function to discover the files instead it could be avoided. Like this: sqlContext.read.parquet( () => files ) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org