spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <>
Subject [jira] [Commented] (SPARK-20622) Parquet partition discovery for non key=value named directories
Date Sat, 06 May 2017 15:52:04 GMT


Apache Spark commented on SPARK-20622:

User 'prokod' has created a pull request for this issue:

> Parquet partition discovery for non key=value named directories
> ---------------------------------------------------------------
>                 Key: SPARK-20622
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Noam Asor
> h4. Why
> There are cases where traditional M/R jobs and RDD based Spark jobs writes out partitioned
parquet in 'value only' named directories i.e. {{hdfs:///some/base/path/2017/05/06}} and not
in 'key=value' named directories i.e. {{hdfs:///some/base/path/year=2017/month=05/day=06}}
which prevents users from leveraging Spark SQL parquet partition discovery when reading the
former back.
> h4. What
> This issue is a proposal for a solution which will allow Spark SQL to discover parquet
partitions for 'value only' named directories.
> h4. how
> By introducing a new Spark SQL read option *partitionTemplate*.
> *partitionTemplate* is in a Path form and it should include base path followed by the
missing 'key=' as a template for transforming 'value only' named dirs to 'key=value' named
dirs. In the example above this will look like: 
> {{hdfs:///some/base/path/year=/month=/day=/}}.
> To simplify the solution this option should be tied with *basePath* option, meaning that
*partitionTemplate* option is valid only if *basePath* is set also.
> In the end for the above scenario, this will look something like:
> {code}
>   .option("basePath", "hdfs:///some/base/path")
>   .option("basePath", "hdfs:///some/base/path/year=/month=/day=/")
>   .parquet(...)
> {code}
> which will allow Spark SQL to do parquet partition discovery on the following directory
> {code}
> some
>   |--base
>        |--path
>              |--2016
>                   |--...
>              |--2017
>                    |--01
>                    |--02
>                        |--...
>                        |--15
>                        |--...
>                    |--...
> {code}
> adding to the schema of the resulted DataFrame the columns year, month, day and their
respective values as expected.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message