hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sankar Hariappan (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-16676) Bootstrap REPL DUMP should ensure no data loss due to concurrent RENAME operations.
Date Tue, 13 Jun 2017 04:55:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-16676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sankar Hariappan updated HIVE-16676:
------------------------------------
    Description: 
 For bootstrap dump, if the table is renamed after fetching the table names, then new table
will be missing in the dump and so the target database doesn't have both old and new table.
During incremental replication, later RENAME events will be noop as the old table doesn't
exist in target. This leads to data loss as the table never gets replicated in the target.

To generalise the solution for this issue, the following logic is proposed.
1. In Bootstrap dump logic, after successfully dumping all tables to dumpDir, just traverse
the events and if encounter any rename event, then shall do the following.
If target table/partition exists in dumpDir, then do nothing. The event load in target should
be idempotent and should apply events only if the event is newer than object.
If only source table/partition exists in dumpDir, then check if it is newer than event. If
yes, then dump the target table/partition if it exists in metastore. This is to handle the
scenario of RENAME(A->B)->Create(A)->Dump(A). In this case both A and B should be
there in the dump. In case of RENAME(A->B)->RENAME(B->A)->Dump(A), the target
table B will be missing in metastore and hence nothing to be dumped.
If both source and target tables/partitions are missing in dumpDir, then dump the target table/partition
alone if it exists in the metastore. If target table/partition missing in metastore, it means
some future event would’ve dropped or renamed the target table/partition and hence nothing
to be dumped.
If both source and target tables/partitions are there in the dumpDir, it means, both tables/partitions
are already disconnected and not a renamed version of another and hence nothing to be dumped.
We still use bootstrapBeginReplID as the lastReplID of the bootstrap dump. This ensures REPL
STATUS return it (same as current code) and the next incremental dump won’t miss any events.

We won’t combine bootstrap + incremental in a single dump. This is complex as we won’t
be able to predict the previous state of any object while checking the applicability of any
event. Just need to ensure that in case of rename either source or target table exist in the
dump to avoid loss of data.
It is expected that user will trigger incremental dump/load immediately after bootstrap to
ensure consistent state of the database.
Incremental REPL LOAD:
Modify the event load behaviour in target as follows,
If the object exists, apply the event only if the event is newer than the object for any type
of event. In current code, we do blind replace for some event types.
If the object is missing, then just ignore the event. No need to create the object which will
be ultimately dropped or renamed by future applied event. In current code, we create the object
with metadata available for CREATE and ALTER events whereas TRUNCATE and RENAME fails.
Object will be missed in 2 cases. When it is renamed or dropped.
If replace/create is allowed, then for events like alter, rename shall lead to inconsistent
state.
In case of CREATE_TABLE, we’ll refer to database lastReplId to decide if table needs to
be created or not.
In case of ADD_PARTITION, we’ll refer to table object’s lastReplId to decide if partition
to be added or not.
ALTER, DROP, RENAME, TRUNCATE and INSERT events should be noop if the corresponding object
is missing.
Above logic works for RENAME as well. 
If rename event found the source table exists in metastore, then check if the event is newer
than the object, and if yes, then apply rename. In this case, it is always guaranteed that
the target table doesn't exist.
If source table is missing, then just skip the event as rename would’ve already applied.
It is not necessary to check the target table as it may or may not exist. It will not exist
if it is dropped or renamed. However, it is safe to assume that current rename was already
applied and hence source table is missing.


  was:
 For bootstrap dump, if the table is renamed after fetching the table names, then new table
will be missing in the dump and so the target database doesn't have both old and new table.
During incremental replication, later RENAME events will be noop as the old table doesn't
exist in target. This leads to data loss as the table never gets replicated in the target.

To generalise the solution for this issue, the following logic is proposed.
1. Each table should store the CREATE event ID into the table parameters. If a table follows
Create -> Drop -> Create sequence, then it is easy to differentiate if the table is
old or new one.
2. Bootstrap should combine the delta changes as Incremental Dump into the dumpDir.
3. After bootstrap dump completes, then traverse the events from bootDumpBeginReplId.
    - If a RENAME event is found, then check,
    - If the source table is dumped and create event ID matches, then just dump the RENAME
event as such.
    - If the source table is dumped but the create event ID is later than the event, then
skip the event.
    - If the source table doesn’t exist, but the target table exists, then skip the event.
    - If both source and target tables are missing, then dump the target table to the bootstrap
dumpDir.

4. For other events, just dump the event with following logic.
    - CREATE: If object exists, then skip else dump it.
    - DROP: If object doesn’t exist, then skip else dump it.
    - ALTER: If the object exist and the create event ID matches, then dump else skip it.

5. Rename event load should check,
    - If source table exists and if create event ID is same, then apply the event else skip
it.
    - If source table doesn’t exist, then check if the target table exists, if yes, then
skip the event.



> Bootstrap REPL DUMP should ensure no data loss due to concurrent RENAME operations.
> -----------------------------------------------------------------------------------
>
>                 Key: HIVE-16676
>                 URL: https://issues.apache.org/jira/browse/HIVE-16676
>             Project: Hive
>          Issue Type: Sub-task
>          Components: repl
>    Affects Versions: 2.1.0
>            Reporter: Sankar Hariappan
>            Assignee: Sankar Hariappan
>
>  For bootstrap dump, if the table is renamed after fetching the table names, then new
table will be missing in the dump and so the target database doesn't have both old and new
table. During incremental replication, later RENAME events will be noop as the old table doesn't
exist in target. This leads to data loss as the table never gets replicated in the target.
> To generalise the solution for this issue, the following logic is proposed.
> 1. In Bootstrap dump logic, after successfully dumping all tables to dumpDir, just traverse
the events and if encounter any rename event, then shall do the following.
> If target table/partition exists in dumpDir, then do nothing. The event load in target
should be idempotent and should apply events only if the event is newer than object.
> If only source table/partition exists in dumpDir, then check if it is newer than event.
If yes, then dump the target table/partition if it exists in metastore. This is to handle
the scenario of RENAME(A->B)->Create(A)->Dump(A). In this case both A and B should
be there in the dump. In case of RENAME(A->B)->RENAME(B->A)->Dump(A), the target
table B will be missing in metastore and hence nothing to be dumped.
> If both source and target tables/partitions are missing in dumpDir, then dump the target
table/partition alone if it exists in the metastore. If target table/partition missing in
metastore, it means some future event would’ve dropped or renamed the target table/partition
and hence nothing to be dumped.
> If both source and target tables/partitions are there in the dumpDir, it means, both
tables/partitions are already disconnected and not a renamed version of another and hence
nothing to be dumped.
> We still use bootstrapBeginReplID as the lastReplID of the bootstrap dump. This ensures
REPL STATUS return it (same as current code) and the next incremental dump won’t miss any
events. 
> We won’t combine bootstrap + incremental in a single dump. This is complex as we won’t
be able to predict the previous state of any object while checking the applicability of any
event. Just need to ensure that in case of rename either source or target table exist in the
dump to avoid loss of data.
> It is expected that user will trigger incremental dump/load immediately after bootstrap
to ensure consistent state of the database.
> Incremental REPL LOAD:
> Modify the event load behaviour in target as follows,
> If the object exists, apply the event only if the event is newer than the object for
any type of event. In current code, we do blind replace for some event types.
> If the object is missing, then just ignore the event. No need to create the object which
will be ultimately dropped or renamed by future applied event. In current code, we create
the object with metadata available for CREATE and ALTER events whereas TRUNCATE and RENAME
fails.
> Object will be missed in 2 cases. When it is renamed or dropped.
> If replace/create is allowed, then for events like alter, rename shall lead to inconsistent
state.
> In case of CREATE_TABLE, we’ll refer to database lastReplId to decide if table needs
to be created or not.
> In case of ADD_PARTITION, we’ll refer to table object’s lastReplId to decide if partition
to be added or not.
> ALTER, DROP, RENAME, TRUNCATE and INSERT events should be noop if the corresponding object
is missing.
> Above logic works for RENAME as well. 
> If rename event found the source table exists in metastore, then check if the event is
newer than the object, and if yes, then apply rename. In this case, it is always guaranteed
that the target table doesn't exist.
> If source table is missing, then just skip the event as rename would’ve already applied.
It is not necessary to check the target table as it may or may not exist. It will not exist
if it is dropped or renamed. However, it is safe to assume that current rename was already
applied and hence source table is missing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message