spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <>
Subject [jira] [Assigned] (SPARK-10185) Spark SQL does not handle comma separates paths on Hadoop FileSystem
Date Tue, 25 Aug 2015 13:28:46 GMT


Apache Spark reassigned SPARK-10185:

    Assignee:     (was: Apache Spark)

> Spark SQL does not handle comma separates paths on Hadoop FileSystem
> --------------------------------------------------------------------
>                 Key: SPARK-10185
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.1
>            Reporter: koert kuipers
> Spark SQL uses a Map[String, String] for data source settings. As a consequence the only
way to pass in multiple paths (something that hadoop file input format supports) is to do
pass in a comma separated list. For example:
> sqlContext.format("json").load("dir1,dir22")
> or
> sqlContext.format("json").option("path", "dir1,dir2").load
> However in this case ResolvedDataSource does not handle the comma delimited paths correctly
for a HadoopFsRelationProvider. It treats the multiple comma delimited paths as single path.
> For example if i pass in for path "dir1,dir2" it will make dir1 qualified but ignore
dir2 (presumably because it simply treats it as part of dir1). If globs are involved then
it simply always returns an empty array of paths (because the glob with comma in it doesn’t
match anything).
> I think its important to handle commas to pass in multiple paths, since the framework
does not provide an alternative. In some cases like parquet the code simply bypasses ResolvedDataSource
to support multiple paths but to me this is a workaround that should be discouraged.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message