hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "cdmikechen (Jira)" <j...@apache.org>
Subject [jira] [Commented] (HUDI-481) Support SQL-like method
Date Tue, 07 Jan 2020 02:58:00 GMT

    [ https://issues.apache.org/jira/browse/HUDI-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17009320#comment-17009320
] 

cdmikechen commented on HUDI-481:
---------------------------------

[~vinoth]
I checked the spark project. It seems that the spark SQL syntax tree only supports *DELETE*
keyword at present. *UPDATE* and *MERGE* are not supported yet. I think this may be because
the design idea of spark is to deal with the relationship between dataset and dataset. Using
existing operators can solve similar problems, but it is not sql-like.
My current idea is to build a layer of SQL syntax on the *hudi-core*, and properly enable
antlr4 to process semantics. For example, the update statement can be parsed into first filtering
data according to where conditions, and then upsert the data into hudi.

> Support SQL-like method
> -----------------------
>
>                 Key: HUDI-481
>                 URL: https://issues.apache.org/jira/browse/HUDI-481
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: CLI
>            Reporter: cdmikechen
>            Priority: Minor
>
> As we know, Hudi use spark datasource api to upsert data. For example, if we want to
update a data, we need to get the old row's data first, and use upsert method to update this
row.
> But there's another situation where someone just wants to update one column of data.
If we use a sql to describe, it is {{update table set col1 = X where col2 = Y}}. This is
something hudi cannot deal with directly at present, we can only get all the data involved
as a dataset first and then merge it.
> So I think maybe we can create a new subproject to process the batch data in an sql-like
method. For example.
>  {code}
> val hudiTable = new HudiTable(path)
> hudiTable.update.set("col1 = X").where("col2 = Y")
> hudiTable.delete.where("col3 = Z")
> hudiTable.commit
> {code}
> It may also extend the functionality and support jdbc-like RFC schemes: [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller]
> Hope every one can provide some suggestions to see if this plan is feasible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message