atlas-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suma Shivaprasad (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (ATLAS-568) Parallelize Hive hook operations
Date Tue, 22 Mar 2016 17:59:25 GMT

     [ https://issues.apache.org/jira/browse/ATLAS-568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Suma Shivaprasad updated ATLAS-568:
-----------------------------------
    Description: 
Maintaining the same order of operations that were executed in hive is crucial also on top
of ATLAS . This is because if they are not ordered, it could easily lead to correctness issues
in the ATLAS repository. For eg: Table columns being dropped and then table is renamed , dropping
tables, databases etc all need to be executed in the same order as they were in hive metastore.
There are multiple issues that needs to be addressed here

1. How do we ensure order of messages on the producer/hook side?
2. Once producer/hook publishes these messages onto KAFKA, how do we ensure the order of processing
is the same as it was published.

One suggested approach is to assign a timestamp to all the messages on the producer side and
have a window/batch these messages on the consumer/ATLAS server side. 



  was:
Maintaining the same order of operations that were executed in hive is crucial also on top
of ATLAS . This is because if they are not ordered, it could easily lead to correctness issues
in the ATLAS repository. For eg: Table columns being dropped and then table is renamed , dropping
tables, databases etc all need to be executed in the same order as they were in hive metastore.
There are multiple issues that needs to be addressed here

1. How do we ensure order of messages on the producer/hook side?
2. Once producer/hook publishes these messages onto KAFKA, how do we ensure the order of processing
is the same as it was published.

One suggested approach is to assign a timestamp to all the messages on the producer side and
have a window/batch these messages on the consumer/ATLAS server side. Order these messages
according to the timestamp within the window which is a configured time period and then execute
these operations in that order.




>  Parallelize Hive hook operations
> ---------------------------------
>
>                 Key: ATLAS-568
>                 URL: https://issues.apache.org/jira/browse/ATLAS-568
>             Project: Atlas
>          Issue Type: Sub-task
>    Affects Versions: 0.7-incubating
>            Reporter: Suma Shivaprasad
>             Fix For: 0.7-incubating
>
>
> Maintaining the same order of operations that were executed in hive is crucial also on
top of ATLAS . This is because if they are not ordered, it could easily lead to correctness
issues in the ATLAS repository. For eg: Table columns being dropped and then table is renamed
, dropping tables, databases etc all need to be executed in the same order as they were in
hive metastore. There are multiple issues that needs to be addressed here
> 1. How do we ensure order of messages on the producer/hook side?
> 2. Once producer/hook publishes these messages onto KAFKA, how do we ensure the order
of processing is the same as it was published.
> One suggested approach is to assign a timestamp to all the messages on the producer side
and have a window/batch these messages on the consumer/ATLAS server side. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message