hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ethan Guo (Jira)" <j...@apache.org>
Subject [jira] [Updated] (HUDI-552) Fix the schema mismatch in Row-to-Avro conversion
Date Sun, 19 Jan 2020 07:42:00 GMT

     [ https://issues.apache.org/jira/browse/HUDI-552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ethan Guo updated HUDI-552:
---------------------------
    Description: 
When using the `FilebasedSchemaProvider` to provide the source schema in Avro, while ingesting
data from `ParquetDFSSource` with the same schema, the DeltaStreamer failed.  A new test
case is added below to demonstrate the error:

!Screen Shot 2020-01-18 at 12.12.58 AM.png|width=543,height=392!

!Screen Shot 2020-01-18 at 12.13.08 AM.png|width=546,height=165!

Based on further investigation, the root cause is that when writing parquet files in Spark,
all fields are automatically [converted to be nullable|https://spark.apache.org/docs/latest/sql-data-sources-parquet.html]
for compatibility reasons.  If the source Avro schema has non-null fields, `AvroConversionUtils.createRdd`
still uses the `dataType` from the Dataframe to convert the Row to Avro record.  The `dataType`
has nullable fields based on Spark logic, even though the field names are identical as the
source Avro schema.  Thus the resulting Avro records from the conversion have different schema
(only nullability difference) compared to the source schema file.  Before inserting the records,
there are other operations using the source schema file, causing failure of serialization/deserialization
because of this schema mismatch.

 

The following screenshot shows the modified Avro schema in `AvroConversionUtils.createRdd`. 
The original source schema file is:

!Screen Shot 2020-01-18 at 12.31.23 AM.png|width=844,height=349!

 

!Screen Shot 2020-01-18 at 12.15.09 AM.png|width=850,height=471!

 

Note that for some Avro schema, the DeltaStreamer sync may succeed but generate corrupt data. 
This behavior of generating corrupt data is originally reported by [~liujinhui].

  was:
When using the `FilebasedSchemaProvider` to provide the source schema in Avro, while ingesting
data from `ParquetDFSSource` with the same schema, the DeltaStreamer failed.  A new test
case is added below to demonstrate the error:

!Screen Shot 2020-01-18 at 12.12.58 AM.png|width=543,height=392!

!Screen Shot 2020-01-18 at 12.13.08 AM.png|width=546,height=165!

Based on further investigation, the root cause is that when writing parquet files in Spark,
all fields are automatically [converted to be nullable|https://spark.apache.org/docs/latest/sql-data-sources-parquet.html]
for compatibility reasons.  If the source Avro schema has non-null fields, `AvroConversionUtils.createRdd`
still uses the `dataType` from the Dataframe to convert the Row to Avro record.  The `dataType`
has nullable fields based on Spark logic, even though the field names are identical as the
source Avro schema.  Thus the resulting Avro records from the conversion have different schema
(only nullability difference) compared to the source schema file.  Before inserting the records,
there are other operations using the source schema file, causing failure of serialization/deserialization
because of this schema mismatch.

 

The following screenshot shows the modified Avro schema in `AvroConversionUtils.createRdd`. 
The original source schema file is:

!Screen Shot 2020-01-18 at 12.31.23 AM.png|width=844,height=349!

 

!Screen Shot 2020-01-18 at 12.15.09 AM.png|width=850,height=471!

 

Note that for some Avro schema, the DeltaStreamer sync may succeed but generate corrupt data. 
This behavior is originally reported by [~liujinhui].


> Fix the schema mismatch in Row-to-Avro conversion
> -------------------------------------------------
>
>                 Key: HUDI-552
>                 URL: https://issues.apache.org/jira/browse/HUDI-552
>             Project: Apache Hudi (incubating)
>          Issue Type: Sub-task
>          Components: Spark Integration
>            Reporter: Ethan Guo
>            Assignee: Ethan Guo
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.5.1
>
>         Attachments: Screen Shot 2020-01-18 at 12.12.58 AM.png, Screen Shot 2020-01-18
at 12.13.08 AM.png, Screen Shot 2020-01-18 at 12.15.09 AM.png, Screen Shot 2020-01-18 at 12.31.23
AM.png
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> When using the `FilebasedSchemaProvider` to provide the source schema in Avro, while
ingesting data from `ParquetDFSSource` with the same schema, the DeltaStreamer failed.  A
new test case is added below to demonstrate the error:
> !Screen Shot 2020-01-18 at 12.12.58 AM.png|width=543,height=392!
> !Screen Shot 2020-01-18 at 12.13.08 AM.png|width=546,height=165!
> Based on further investigation, the root cause is that when writing parquet files in
Spark, all fields are automatically [converted to be nullable|https://spark.apache.org/docs/latest/sql-data-sources-parquet.html]
for compatibility reasons.  If the source Avro schema has non-null fields, `AvroConversionUtils.createRdd`
still uses the `dataType` from the Dataframe to convert the Row to Avro record.  The `dataType`
has nullable fields based on Spark logic, even though the field names are identical as the
source Avro schema.  Thus the resulting Avro records from the conversion have different schema
(only nullability difference) compared to the source schema file.  Before inserting the records,
there are other operations using the source schema file, causing failure of serialization/deserialization
because of this schema mismatch.
>  
> The following screenshot shows the modified Avro schema in `AvroConversionUtils.createRdd`. 
The original source schema file is:
> !Screen Shot 2020-01-18 at 12.31.23 AM.png|width=844,height=349!
>  
> !Screen Shot 2020-01-18 at 12.15.09 AM.png|width=850,height=471!
>  
> Note that for some Avro schema, the DeltaStreamer sync may succeed but generate corrupt
data.  This behavior of generating corrupt data is originally reported by [~liujinhui].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message