spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Li Jin (JIRA)" <>
Subject [jira] [Commented] (SPARK-20144) no long maintains ordering of the data
Date Fri, 31 Mar 2017 15:03:41 GMT


Li Jin commented on SPARK-20144:

Also, I am not sure about "If the data were sorted, sorting would be pretty cheap". Can you
explain more on this?

> no long maintains ordering of the data
> ---------------------------------------------------------
>                 Key: SPARK-20144
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.2
>            Reporter: Li Jin
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is when we
read parquet files in 2.0.2, the ordering of rows in the resulting dataframe is not the same
as the ordering of rows in the dataframe that the parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into fewer partitions
and also reordered them. This breaks our workflows because they assume the ordering of the
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec changed quite
a bit from 2.0.2 to 2.1, so not sure if this is an issue with 2.1.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message