hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "vinoyang (Jira)" <>
Subject [jira] [Created] (HUDI-613) Refactor and enhance the Transformer component
Date Fri, 14 Feb 2020 13:17:00 GMT
vinoyang created HUDI-613:

             Summary: Refactor and enhance the Transformer component
                 Key: HUDI-613
             Project: Apache Hudi (incubating)
          Issue Type: Bug
            Reporter: vinoyang

Currently, Hudi has a component that has not been widely used: Transformer. As we all know,
before the original data fell into the data lake, a very common operation is data preprocessing
and ETL. This is also the most common use scenario of many computing engines, such as Flink
and Spark. Now that Hudi has taken advantage of the power of the computing engine, it can
also naturally take advantage of its ability of data preprocessing. We can refactor the Transformer
to make it become more flexible. To summarize, we can refactor from the following aspects:

* Decouple Transformer from Spark
* Enrich the Transformer and provide built-in transformer
* Support Transformer-chain

For the first point, the Transformer interface is tightly coupled with Spark in design, and
it contains a Spark-specific context. This makes it impossible for us to take advantage of
the transform capabilities provided by other engines (such as Flink) after supporting multiple
engines. Therefore, we need to decouple it from Spark in design.

For the second point, we can enhance the Transformer and provide some out-of-the-box Transformers,
such as FilterTransformer, FlatMapTrnasformer, and so on.

For the third point, the most common pattern for data processing is the pipeline model, and
the common implementation of the pipeline model is the responsibility chain model, which can
be compared to the Apache commons chain[1], combining multiple Transformers can make data-processing
become more flexible and expandable.

If we enhance the capabilities of Transformer components, Hudi will provide richer data processing
capabilities based on the computing engine.

The relevant discussion thread is here:

This message was sent by Atlassian Jira

View raw message