hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "anishek (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-20911) External Table Replication for Hive
Date Thu, 15 Nov 2018 06:48:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

anishek updated HIVE-20911:
---------------------------
    Description: 
External tables are not replicated currently as part of hive replication. As part of this
jira we want to enable that.

Approach:
* Target cluster will have a top level base directory config that will be used to copy all
data relevant to external tables. This will be provided via the *with* clause in the *repl
load* command. This base path will be prefixed to the path of the same external table on source
cluster.
* Since changes to directories on the external table can happen without hive knowing it, hence
we cant capture the relevant events when ever new data is added or removed, we will have to
copy the data from the source path to target path for external tables every time we run incremental
replication.
** this will require incremental *repl dump*  to now create an additional file *\_external\_tables\_info*
with data in the following form 
{code}
tableName,base64Encoded(tableDataLocation)
{code}
** *repl load* will read the  *\_external\_tables\_info* to identify what locations are to
be copied from source to target and create corresponding tasks for them.
* New External tables will be created with metadata only with no data copied as part of regular
tasks while incremental load/bootstrap load.
* Bootstrap dump will also create  *\_external\_tables\_info* which will be used to copy data
from source to target  as part of boostrap load.

  was:
External tables are not replicated currently as part of hive replication. As part of this
jira we want to enable that.

Approach:
* Target cluster will have a top level base directory config that will be used to copy all
data relevant to external tables. This will be provided via the *with* clause in the *repl
load* command. This base path will be prefixed to the path of the same external table on source
cluster.
* Since changes to directories on the external table can happen without hive knowing it, hence
we cant capture the relevant events when ever new data is added or removed, we will have to
copy the data from the source path to target path for external tables every time we run incremental
replication.
** this will require incremental *repl dump*  to now create an additional file *\_external\_tables\_info*
with data in the following form 
{code}
OpearationType,tableName,base64Encoded(tableDataLocation)
{code}
where OpeartionType can be one in (ADD, REMOVE)
** *repl load* will look up all the external tables on target and remove tables listed with
REMOVE type in the above file.
** For the remaining tables it will create tasks for the corresponding paths from source to
target along with the existing tasks for incremental load.
* New External tables will be created with data copied as part of regular tasks wile incremental
load, applying the base directory prefix
* Bootstrap will also create / copy these external tables as part of their regular workflow,
applying the base directory prefix


> External Table Replication for Hive
> -----------------------------------
>
>                 Key: HIVE-20911
>                 URL: https://issues.apache.org/jira/browse/HIVE-20911
>             Project: Hive
>          Issue Type: Bug
>          Components: HiveServer2
>    Affects Versions: 4.0.0
>            Reporter: anishek
>            Assignee: anishek
>            Priority: Critical
>             Fix For: 4.0.0
>
>
> External tables are not replicated currently as part of hive replication. As part of
this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be used to copy
all data relevant to external tables. This will be provided via the *with* clause in the *repl
load* command. This base path will be prefixed to the path of the same external table on source
cluster.
> * Since changes to directories on the external table can happen without hive knowing
it, hence we cant capture the relevant events when ever new data is added or removed, we will
have to copy the data from the source path to target path for external tables every time we
run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional file *\_external\_tables\_info*
with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> ** *repl load* will read the  *\_external\_tables\_info* to identify what locations are
to be copied from source to target and create corresponding tasks for them.
> * New External tables will be created with metadata only with no data copied as part
of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be used to copy
data from source to target  as part of boostrap load.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message