hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sushanth Sowmyan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-16266) Enable function metadata to be written during bootstrap
Date Sat, 18 Aug 2018 17:23:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-16266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16584850#comment-16584850
] 

Sushanth Sowmyan commented on HIVE-16266:
-----------------------------------------

Hi [~akolb], apologies if this reply is no longer accurate ([~anishek] or [~sankarh] might
be able to clarify if things have changed - I have not been active with hive for a year now),
but at the time that the repl subsystem was written, that's correct, by intention.

The basic idea is this - hive has two types of tables : MANAGED, where hive is responsible
for the storage, and EXTERNAL, where some other external program is responsible for the storage.
A key way to think about this distinction is what happens when you do a DROP TABLE. For MANAGED
tables, if a DROP TABLE is issued, hive should delete the data on hdfs, since we own and manage
the data as well. For EXTERNAL tables, we are guests, and some other tool is managing the
data, and thus, we should not touch it - we can drop the metadata, but we leave the data on
HDFS alone.

Now, in the case where we're replicating from a primary to a secondary, if the table is a
EXTERNAL table on the primary, then an external tool is managing it on the primary. But what
about the secondary? Since the secondary is being "managed" by Hive Replication, and thus,
Hive, we own and manage it, keeping it in sync with the primary. Thus, by definition, the
copy is MANAGED even if the source is EXTERNAL. If we kept it EXTERNAL, we would start having
some weird midway behaviour that we'd have to add complex rules for - consider the same deletion
scenario:

If we have a DROP PARTITION on the source table, by definition, on the source, we do not delete
the data on source hdfs. The user will likely do a hdfs rm, refresh the data and might do
a ADD PARTITION of new data. Now, what about the destination? Should we delete the data corresponding
to that DROP PARTITION on destination? If so, then it is consistent with behaviour for MANAGED,
rather than EXTERNAL, and thus, we should keep it as MANAGED. If not, then well, we have leftover
data sitting in hdfs in the same location, and if new data gets added in, as a result of an
upcoming ADD PARTITION, then the behaviour is indeterminable depending on the user - it can
be the correct new data, it can be a partial merge or a weird append. That gets messy fast.

So, for this problem and other possible unexpected problems, we decided to be consistent with
the meaning of MANAGED and EXTERNAL, and always make repl destinations MANAGED. :) 

 

> Enable function metadata to be written during bootstrap
> -------------------------------------------------------
>
>                 Key: HIVE-16266
>                 URL: https://issues.apache.org/jira/browse/HIVE-16266
>             Project: Hive
>          Issue Type: Sub-task
>          Components: repl
>    Affects Versions: 2.2.0
>            Reporter: anishek
>            Assignee: anishek
>            Priority: Major
>             Fix For: 3.0.0
>
>         Attachments: HIVE-16266.1.patch, HIVE-16266.2.patch, HIVE-16266.3.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message