hudi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 18717838093 <18717838...@126.com>
Subject Hive integration Improvment
Date Thu, 15 Jul 2021 12:15:08 GMT


Hi, experts.


Currently, Hudi sql statements for DML are executed by Hive Driver with concatenation SQL
statements in most cases. The way SQL is concatenated is hard to maintain and the code is
easy to break. Other than that, multiple versions of Hive cannot be supported at the moment
and makes a lot of headaches for users to use. So, I would like to refactor and refine these
two things for getting a better design and more convenient for users to use.


for example, the following function use driver to execute sql.

HiveSyncTool#syncHoodieTable used for creating a database by driver.
HoodieHiveClient#createTable, for creating a table by driver.
HoodieHiveClient#addPartitionsToTable by driver.
HoodieHiveClient#updatePartitionsToTable by driver.
HoodieHiveClient#updateTableDefinition, alter table by driver.




Other than that, HoodieHiveClient#updateTableProperties, HoodieHiveClient#scanTablePartitions,
HoodieHiveClient#doesTableExist and etc, those metadata operation use client api to execute
sql. Consider from the design, the two pieces are not aligned. So I would think we need to
abstract a unified interface completely for all stuff contact with HMS and does not use Driver
to execute DML. As for the hive that can support multiple versions, we can add a shim layer
to support different versions of HMS.


I have a preliminary conception of the design in RFC-31 (https://cwiki.apache.org/confluence/display/HUDI/RFC+-+31%3A+Hive+integration+Improvment).
 I hope everyone can help with some reviews and provide some suggestions. thank you very much.


- Looking forward to your reply.


minglei




| |
18717838093
|
|
18717838093@126.com
|
签名由网易邮箱大师定制

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message