hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yixue (Andrew) Zhu (Jira)" <j...@apache.org>
Subject [jira] [Comment Edited] (HUDI-603) HoodieDeltaStreamer should periodically fetch table schema update
Date Sun, 23 Feb 2020 18:37:00 GMT

    [ https://issues.apache.org/jira/browse/HUDI-603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17043016#comment-17043016
] 

Yixue (Andrew) Zhu edited comment on HUDI-603 at 2/23/20 6:36 PM:
------------------------------------------------------------------

I am still working on reading Hudi code base, but I think one possible approach would work:
 # A SchemaProvider derived class can be introduced to retrieve latest Schema if needed, from
Confluence Schema registry. 
 # Enhance class AvroSource or Source derived class to record Avro schema id for serialization,
as used by Confluence Schema registry. When deserialized from Kafka, or for compaction, translate
to refreshed schema (shortcut if schema ids match), snapshot by HoodieWriteHandle (or derived)
class, from SchemaProvider.
 # Custom serializer for GenericRecord can be registered in Spark, to use schema id.


was (Author: yx3zhu@gmail.com):
I am still working on reading Hudi code base, but I think one possible approach would work:
 # A SchemaProvider derived class can be introduced to retrieve latest Schema if needed, from
Confluence Schema registry. 
 # Enhance class AvroSource or Source derived class to record Avro schema id for serialization,
as used by Confluence Schema registry. When deserialized for compaction, translate to refreshed
schema (shortcut if schema ids match), snapshot by HoodieWriteHandle (or derived) class, from SchemaProvider.
 # Custom serializer for GenericRecord can be registered in Spark, to use schema id.

> HoodieDeltaStreamer should periodically fetch table schema update
> -----------------------------------------------------------------
>
>                 Key: HUDI-603
>                 URL: https://issues.apache.org/jira/browse/HUDI-603
>             Project: Apache Hudi (incubating)
>          Issue Type: Bug
>          Components: DeltaStreamer
>            Reporter: Yixue Zhu
>            Priority: Major
>              Labels: evolution, schema
>
> HoodieDeltaStreamer create SchemaProvider instance and delegate to DeltaSync for periodical
sync. However, default implementation of SchemaProvider does not refresh schema, which can
change due to schema evolution. DeltaSync snapshot the schema when it creates writeClient,
using the SchemaProvider instance or pick up from source, and the schema for writeClient is
not refreshed during the loop of Sync.
> I think this needs to be addressed to support schema evolution fully.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message