hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sankar Hariappan (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-21761) Support table level replication in Hive
Date Mon, 10 Jun 2019 14:24:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-21761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sankar Hariappan updated HIVE-21761:
------------------------------------
    Description: 
*Requirements:*
{code}
- User needs to define replication policy to replicate any specific table. This enables user
to replicate only the business critical tables instead of replicating all tables which may
throttle the network bandwidth, storage and also slow-down Hive replication.
- User needs to define replication policy using regular expressions (such as db.sales_*) and
needs to include additional tables which are non-matching given pattern and exclude some tables
which are matching given pattern.
- User needs to dynamically add/remove tables to the list either by manually changing the
replication policy during run time.
{code}

*Design:*
{code}
1. Hive continue to support DB level replication policy of format <db_name>.* but logically,
we support the policy as <db_name>.(t1, t3, …).

2. Regular expression can also be supported as replication policy. For example,
  a. <db_name>.[<prefix*>], 
  b. <db_name>.[<*suffix>], 
  c. <db_name>.[<prefix*suffix>]. 

3. If regular expression is provided as replication policy, then Hive also accepts include
and exclude lists as input which also helps to dynamically add/remove tables for replication.
  a. Exclude list specifies the tables to be excluded even if it satisfies the regular expression.

  b. Include list specifies the tables to be included in addition to the tables satisfying
the regular expression. 

4. New format for the Replication policy have 3 parts all separated with Dot (.).
  a. First part is DB name.
  b. Second part is included list. Comma separated table names/regex with in square brackets[].
 If square brackets are not there, then it is treated as single table replication which skips
DB level events.
  c. Third part is excluded list. Comma separated table names/regex with in square brackets[].
    - <db_name> -- Full DB replication which is currently supported
    - <db_name>.['.*?']  -- Full DB replication
    - <db_name>.[] -- Replicate just functions and not include any tables.
    - <db_name>.['t1', 't3']  -- DB replication with static list of tables t1 and t3
included.
    - <db_name>.['t1*', 't2'].['t100'] -- DB replication with all tables having prefix
t1 and also include table t2 which doesn’t have prefix t1 and exclude t100 which has the
prefix t1.

5. If the DB property “repl.source.for” is set, then by default all the tables in the
DB will be enabled for replication and will continue to archive deleted data to CM path.

6. REPL DUMP takes 2 inputs along with existing FROM and WITH clause.
  a. REPL DUMP <current_repl_policy> [REPLACE <previous_repl_policy> FROM <last_repl_id>
WITH <key_values_list>;
current_repl_policy and previous_repl_policy can be any format mentioned in Point-4.
  b. REPLACE clause to be supported to take previous repl policy as input. 
  c. Rest of the format remains same.

7. Now, REPL DUMP on this DB will replicate the tables based on current_repl_policy.

8. Single table replication of format <db_name>.t1 doesn’t allow changing the policy
dynamically. So REPLACE clause is not allowed if previous_repl_policy of this format.

9. If any table is added dynamically either due to change in regular expression or added to
include list should be bootstrapped. 
  a. Hive will automatically figure out the list of tables newly included in the list by comparing
the current_repl_policy & previous_repl_policy inputs and combine bootstrap dump for added
tables as part of incremental dump. As we can combine first incremental with bootstrap dump,
it removes the current limitation of target DB being inconsistent after bootstrap unless we
run first incremental replication.
  b. If any table is renamed, then it may gets dynamically added/removed for replication based
on defined replication policy + include/exclude list. So, Hive will perform bootstrap for
the table which is just included after rename. 
  c. Also, if renamed table is excluded from replication policy, then need to drop the old
table at target as well.

10. Only the initial bootstrap load expects the target DB to be empty but the intermediate
bootstrap on tables due to regex or inclusion/exclusion list change or renames doesn’t expect
the target DB or table to be empty. If any table with same name exist during such bootstrap,
the table will be overwritten including data.
{code}


  was:
*Requirements:*
{code}
- User needs to define replication policy to replicate any specific table. This enables user
to replicate only the business critical tables instead of replicating all tables which may
throttle the network bandwidth, storage and also slow-down Hive replication.
- User needs to define replication policy using regular expressions (such as db.sales_*) and
needs to include additional tables which are non-matching given pattern and exclude some tables
which are matching given pattern.
- User needs to dynamically add/remove tables to the list either by manually changing the
replication policy during run time.
{code}

*Design:*
{code}
1. Hive continue to support DB level replication policy of format <db_name>.* but logically,
we support the policy as <db_name>.(t1, t3, …).

2. Regular expression can also be supported as replication policy. For example,
  a. <db_name>.[<prefix*>], 
  b. <db_name>.[<*suffix>], 
  c. <db_name>.[<prefix*suffix>]. 

3. If regular expression is provided as replication policy, then Hive also accepts include
and exclude lists as input which also helps to dynamically add/remove tables for replication.
  a. Exclude list specifies the tables to be excluded even if it satisfies the regular expression.

  b. Include list specifies the tables to be included in addition to the tables satisfying
the regular expression. 

4. New format for the Replication policy have 3 parts all separated with Dot (.).
  a. First part is DB name.
  b. Second part is included list. Comma separated table names/regex with in square brackets[].
 If square brackets are not there, then it is treated as single table replication which skips
DB level events.
  c. Third part is excluded list. Comma separated table names/regex with in square brackets[].
    - <db_name> -- Full DB replication which is currently supported
    - <db_name>.[]  - Full DB replication
    - <db_name>.['.*?']  - Full DB replication
    - <db_name>.t1 -- Single table replication (DB events excluded) which is currently
supported
    - <db_name>.['t1', 't3']  -- DB replication with static list of tables t1 and t3
included.
    - <db_name>.['t1*', 't2'].['t100'] -- DB replication with all tables having prefix
t1 and also include table t2 which doesn’t have prefix t1 and exclude t100 which has the
prefix t1.

5. If the DB property “repl.source.for” is set, then by default all the tables in the
DB will be enabled for replication and will continue to archive deleted data to CM path.

6. REPL DUMP takes 2 inputs along with existing FROM and WITH clause.
  a. REPL DUMP <current_repl_policy> [REPLACE <previous_repl_policy> FROM <last_repl_id>
WITH <key_values_list>;
current_repl_policy and previous_repl_policy can be any format mentioned in Point-4.
  b. REPLACE clause to be supported to take previous repl policy as input. 
  c. Rest of the format remains same.

7. Now, REPL DUMP on this DB will replicate the tables based on current_repl_policy.

8. Single table replication of format <db_name>.t1 doesn’t allow changing the policy
dynamically. So REPLACE clause is not allowed if previous_repl_policy of this format.

9. If any table is added dynamically either due to change in regular expression or added to
include list should be bootstrapped. 
  a. Hive will automatically figure out the list of tables newly included in the list by comparing
the current_repl_policy & previous_repl_policy inputs and combine bootstrap dump for added
tables as part of incremental dump. As we can combine first incremental with bootstrap dump,
it removes the current limitation of target DB being inconsistent after bootstrap unless we
run first incremental replication.
  b. If any table is renamed, then it may gets dynamically added/removed for replication based
on defined replication policy + include/exclude list. So, Hive will perform bootstrap for
the table which is just included after rename. 
  c. Also, if renamed table is excluded from replication policy, then need to drop the old
table at target as well.

10. Only the initial bootstrap load expects the target DB to be empty but the intermediate
bootstrap on tables due to regex or inclusion/exclusion list change or renames doesn’t expect
the target DB or table to be empty. If any table with same name exist during such bootstrap,
the table will be overwritten including data.
{code}



> Support table level replication in Hive
> ---------------------------------------
>
>                 Key: HIVE-21761
>                 URL: https://issues.apache.org/jira/browse/HIVE-21761
>             Project: Hive
>          Issue Type: New Feature
>          Components: repl
>            Reporter: Sankar Hariappan
>            Assignee: Sankar Hariappan
>            Priority: Major
>              Labels: DR, Replication
>
> *Requirements:*
> {code}
> - User needs to define replication policy to replicate any specific table. This enables
user to replicate only the business critical tables instead of replicating all tables which
may throttle the network bandwidth, storage and also slow-down Hive replication.
> - User needs to define replication policy using regular expressions (such as db.sales_*)
and needs to include additional tables which are non-matching given pattern and exclude some
tables which are matching given pattern.
> - User needs to dynamically add/remove tables to the list either by manually changing
the replication policy during run time.
> {code}
> *Design:*
> {code}
> 1. Hive continue to support DB level replication policy of format <db_name>.* but
logically, we support the policy as <db_name>.(t1, t3, …).
> 2. Regular expression can also be supported as replication policy. For example,
>   a. <db_name>.[<prefix*>], 
>   b. <db_name>.[<*suffix>], 
>   c. <db_name>.[<prefix*suffix>]. 
> 3. If regular expression is provided as replication policy, then Hive also accepts include
and exclude lists as input which also helps to dynamically add/remove tables for replication.
>   a. Exclude list specifies the tables to be excluded even if it satisfies the regular
expression. 
>   b. Include list specifies the tables to be included in addition to the tables satisfying
the regular expression. 
> 4. New format for the Replication policy have 3 parts all separated with Dot (.).
>   a. First part is DB name.
>   b. Second part is included list. Comma separated table names/regex with in square brackets[].
 If square brackets are not there, then it is treated as single table replication which skips
DB level events.
>   c. Third part is excluded list. Comma separated table names/regex with in square brackets[].
>     - <db_name> -- Full DB replication which is currently supported
>     - <db_name>.['.*?']  -- Full DB replication
>     - <db_name>.[] -- Replicate just functions and not include any tables.
>     - <db_name>.['t1', 't3']  -- DB replication with static list of tables t1 and
t3 included.
>     - <db_name>.['t1*', 't2'].['t100'] -- DB replication with all tables having
prefix t1 and also include table t2 which doesn’t have prefix t1 and exclude t100 which
has the prefix t1.
> 5. If the DB property “repl.source.for” is set, then by default all the tables in
the DB will be enabled for replication and will continue to archive deleted data to CM path.
> 6. REPL DUMP takes 2 inputs along with existing FROM and WITH clause.
>   a. REPL DUMP <current_repl_policy> [REPLACE <previous_repl_policy> FROM
<last_repl_id> WITH <key_values_list>;
> current_repl_policy and previous_repl_policy can be any format mentioned in Point-4.
>   b. REPLACE clause to be supported to take previous repl policy as input. 
>   c. Rest of the format remains same.
> 7. Now, REPL DUMP on this DB will replicate the tables based on current_repl_policy.
> 8. Single table replication of format <db_name>.t1 doesn’t allow changing the
policy dynamically. So REPLACE clause is not allowed if previous_repl_policy of this format.
> 9. If any table is added dynamically either due to change in regular expression or added
to include list should be bootstrapped. 
>   a. Hive will automatically figure out the list of tables newly included in the list
by comparing the current_repl_policy & previous_repl_policy inputs and combine bootstrap
dump for added tables as part of incremental dump. As we can combine first incremental with
bootstrap dump, it removes the current limitation of target DB being inconsistent after bootstrap
unless we run first incremental replication.
>   b. If any table is renamed, then it may gets dynamically added/removed for replication
based on defined replication policy + include/exclude list. So, Hive will perform bootstrap
for the table which is just included after rename. 
>   c. Also, if renamed table is excluded from replication policy, then need to drop the
old table at target as well.
> 10. Only the initial bootstrap load expects the target DB to be empty but the intermediate
bootstrap on tables due to regex or inclusion/exclusion list change or renames doesn’t expect
the target DB or table to be empty. If any table with same name exist during such bootstrap,
the table will be overwritten including data.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message