Mailing-List: contact issues-help@spark.apache.org; run by ezmlm
Precedence: bulk
Date: Tue, 19 May 2015 08:15:59 +0000 (UTC)
From: "Cheng Lian (JIRA)" <jira@apache.org>
To: issues@spark.apache.org
Message-ID: <JIRA.12785563.1427298660000.153972.1432023359780@Atlassian.JIRA>
In-Reply-To: <JIRA.12785563.1427298660000@Atlassian.JIRA>
References: <JIRA.12785563.1427298660000@Atlassian.JIRA>
 <JIRA.12785563.1427298660111@arcas>
Subject: [jira] [Commented] (SPARK-6533) Allow using wildcard and other file
 pattern in Parquet DataSource
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/SPARK-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550018#comment-14550018 ] 

Cheng Lian commented on SPARK-6533:
-----------------------------------

This ticket duplicates SPARK-3928, which has been fixed by PR #5526 https://github.com/apache/spark/pull/5526

I'm resolving this.

> Allow using wildcard and other file pattern in Parquet DataSource
> -----------------------------------------------------------------
>
>                 Key: SPARK-6533
>                 URL: https://issues.apache.org/jira/browse/SPARK-6533
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.3.0, 1.3.1
>            Reporter: Jianshi Huang
>            Priority: Critical
>
> By default, spark.sql.parquet.useDataSourceApi is set to true. And loading parquet files using file pattern will throw errors.
> *\*Wildcard*
> {noformat}
> scala> val qp = sqlContext.parquetFile("hdfs://.../source=live/date=2014-06-0*")
> 15/03/25 08:43:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> 15/03/25 08:43:59 WARN hdfs.BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
> java.io.FileNotFoundException: File does not exist: hdfs://.../source=live/date=2014-06-0*
>   at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128)
>   at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
>   at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
>   at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:276)
>   at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267)
>   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267)
>   at org.apache.spark.sql.parquet.ParquetRelation2.<init>(newParquet.scala:388)
>   at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522)
> {noformat}
> And
> *\[abc\]*
> {noformat}
> val qp = sqlContext.parquetFile("hdfs://.../source=live/date=2014-06-0[12]")
> java.lang.IllegalArgumentException: Illegal character in path at index 74: hdfs://.../source=live/date=2014-06-0[12]
>   at java.net.URI.create(URI.java:859)
>   at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:268)
>   at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267)
>   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267)
>   at org.apache.spark.sql.parquet.ParquetRelation2.<init>(newParquet.scala:388)
>   at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522)
>   ... 49 elided
> Caused by: java.net.URISyntaxException: Illegal character in path at index 74: hdfs://.../source=live/date=2014-06-0[12]
>   at java.net.URI$Parser.fail(URI.java:2829)
>   at java.net.URI$Parser.checkChars(URI.java:3002)
>   at java.net.URI$Parser.parseHierarchical(URI.java:3086)
>   at java.net.URI$Parser.parse(URI.java:3034)
>   at java.net.URI.<init>(URI.java:595)
>   at java.net.URI.create(URI.java:857)
> {noformat}
> If spark.sql.parquet.useDataSourceApi is not enabled we cannot have partition discovery, schema evolution etc, but being able to specify file pattern is also very important to applications.
> Please add this important feature.
> Jianshi


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org