Mailing-List: contact issues-help@spark.apache.org; run by ezmlm
Precedence: bulk
Date: Tue, 13 Jun 2017 13:02:00 +0000 (UTC)
From: "Michel Lemay (JIRA)" <jira@apache.org>
To: issues@spark.apache.org
Message-ID: <JIRA.13078301.1496924798000.9403.1497358920136@Atlassian.JIRA>
In-Reply-To: <JIRA.13078301.1496924798000@Atlassian.JIRA>
References: <JIRA.13078301.1496924798000@Atlassian.JIRA> <JIRA.13078301.1496924798925@jira-lw-us.apache.org>
Subject: [jira] [Comment Edited] (SPARK-21021) Reading partitioned parquet
 does not respect specified schema column order
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Tue, 13 Jun 2017 13:02:06 -0000


    [ https://issues.apache.org/jira/browse/SPARK-21021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16047849#comment-16047849 ] 

Michel Lemay edited comment on SPARK-21021 at 6/13/17 1:01 PM:
---------------------------------------------------------------

Yes, as a workaround, we do a {code}df.select(schema.fieldNames.head, schema.fieldNames.tail: _*){code}


However, I think that the reader should respect schema column order even in the case of partitions.

something like `dataSchema ++ (partitionSchema - dataSchema)`


was (Author: flamingmike):
Yes, as a workaround, we do a `df.select(schema.fieldNames.head, schema.fieldNames.tail: _*)`

However, I think that the reader should respect schema column order even in the case of partitions.

something like `dataSchema ++ (partitionSchema - dataSchema)`


> Reading partitioned parquet does not respect specified schema column order
> --------------------------------------------------------------------------
>
>                 Key: SPARK-21021
>                 URL: https://issues.apache.org/jira/browse/SPARK-21021
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Michel Lemay
>            Priority: Minor
>
> When reading back a partitioned parquet folder, column order gets messed up.
> Consider the following example:
> {code}
> case class Event(f1: String, f2: String, f3: String)
> val df = Seq(Event("v1", "v2", "v3")).toDF
> df.write.partitionBy("f1", "f2").parquet("out")
> val schema: StructType = StructType(StructField("f1", StringType, true) :: StructField("f2", StringType, true) :: StructField("f3", StringType, true) :: Nil)
> val dfRead = spark.read.schema(schema).parquet("out")
> dfRead.show
> +---+---+---+
> | f3| f1| f2|
> +---+---+---+
> | v3| v1| v2|
> +---+---+---+
> dfRead.columns
> Array[String] = Array(f3, f1, f2)
> schema.fields
> Array(StructField(f1,StringType,true), StructField(f2,StringType,true), StructField(f3,StringType,true))
> {code}
> This makes it really hard to have compatible schema when reading from multiple sources.


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org