spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dongjoon Hyun <dongjoon.h...@gmail.com>
Subject Re: Schema Evolution in Apache Spark
Date Fri, 12 Jan 2018 14:39:40 GMT
This is about Spark-layer test cases on **read-only** CSV, JSON, Parquet,
ORC files. You can find more details and comparisons in terms of Spatk
support coverage.

Bests,
Dongjoon.


On Thu, Jan 11, 2018 at 22:19 Georg Heiler <georg.kf.heiler@gmail.com>
wrote:

> Isn't this related to the data format used, i.e. parquet, Avro, ... which
> already support changing schema?
>
> Dongjoon Hyun <dongjoon.hyun@gmail.com> schrieb am Fr., 12. Jan. 2018 um
> 02:30 Uhr:
>
>> Hi, All.
>>
>> A data schema can evolve in several ways and Apache Spark 2.3 already
>> supports the followings for file-based data sources like
>> CSV/JSON/ORC/Parquet.
>>
>> 1. Add a column
>> 2. Remove a column
>> 3. Change a column position
>> 4. Change a column type
>>
>> Can we guarantee users some schema evolution coverage on file-based data
>> sources by adding schema evolution test suites explicitly? So far, there
>> are some test cases.
>>
>> For simplicity, I have several assumptions on schema evolution.
>>
>> 1. A safe evolution without data loss.
>>     - e.g. from small types to larger types like int-to-long, not vice
>> versa.
>> 2. Final schema is given by users (or Hive)
>> 3. Simple Spark data types supported by Spark vectorized execution.
>>
>> I made a test case PR to receive your opinions for this.
>>
>> [SPARK-23007][SQL][TEST] Add schema evolution test suite for file-based
>> data sources
>> - https://github.com/apache/spark/pull/20208
>>
>> Could you take a look and give some opinions?
>>
>> Bests,
>> Dongjoon.
>>
>

Mime
View raw message