Return-Path: X-Original-To: apmail-spark-issues-archive@minotaur.apache.org Delivered-To: apmail-spark-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0DA3C17D68 for ; Tue, 19 May 2015 08:16:00 +0000 (UTC) Received: (qmail 73833 invoked by uid 500); 19 May 2015 08:16:00 -0000 Delivered-To: apmail-spark-issues-archive@spark.apache.org Received: (qmail 73806 invoked by uid 500); 19 May 2015 08:15:59 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 73796 invoked by uid 99); 19 May 2015 08:15:59 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 May 2015 08:15:59 +0000 Date: Tue, 19 May 2015 08:15:59 +0000 (UTC) From: "Cheng Lian (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (SPARK-6533) Allow using wildcard and other file pattern in Parquet DataSource MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SPARK-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550018#comment-14550018 ] Cheng Lian commented on SPARK-6533: ----------------------------------- This ticket duplicates SPARK-3928, which has been fixed by PR #5526 https://github.com/apache/spark/pull/5526 I'm resolving this. > Allow using wildcard and other file pattern in Parquet DataSource > ----------------------------------------------------------------- > > Key: SPARK-6533 > URL: https://issues.apache.org/jira/browse/SPARK-6533 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 1.3.0, 1.3.1 > Reporter: Jianshi Huang > Priority: Critical > > By default, spark.sql.parquet.useDataSourceApi is set to true. And loading parquet files using file pattern will throw errors. > *\*Wildcard* > {noformat} > scala> val qp = sqlContext.parquetFile("hdfs://.../source=live/date=2014-06-0*") > 15/03/25 08:43:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable > 15/03/25 08:43:59 WARN hdfs.BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. > java.io.FileNotFoundException: File does not exist: hdfs://.../source=live/date=2014-06-0* > at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128) > at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) > at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) > at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:276) > at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267) > at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267) > at org.apache.spark.sql.parquet.ParquetRelation2.(newParquet.scala:388) > at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522) > {noformat} > And > *\[abc\]* > {noformat} > val qp = sqlContext.parquetFile("hdfs://.../source=live/date=2014-06-0[12]") > java.lang.IllegalArgumentException: Illegal character in path at index 74: hdfs://.../source=live/date=2014-06-0[12] > at java.net.URI.create(URI.java:859) > at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:268) > at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267) > at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267) > at org.apache.spark.sql.parquet.ParquetRelation2.(newParquet.scala:388) > at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522) > ... 49 elided > Caused by: java.net.URISyntaxException: Illegal character in path at index 74: hdfs://.../source=live/date=2014-06-0[12] > at java.net.URI$Parser.fail(URI.java:2829) > at java.net.URI$Parser.checkChars(URI.java:3002) > at java.net.URI$Parser.parseHierarchical(URI.java:3086) > at java.net.URI$Parser.parse(URI.java:3034) > at java.net.URI.(URI.java:595) > at java.net.URI.create(URI.java:857) > {noformat} > If spark.sql.parquet.useDataSourceApi is not enabled we cannot have partition discovery, schema evolution etc, but being able to specify file pattern is also very important to applications. > Please add this important feature. > Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org