hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mich Talebzadeh" <>
Subject RE: Synchronizing Hive metastores across clusters
Date Thu, 17 Dec 2015 21:52:16 GMT
Hi Elliot.


Strictly speaking I believe your question is when the metastore in the replicate gets out
of sync in replicate. So any query against cloud table will only show say partitions at time
T0 as opposed to T1?


I don’t know what your metastore is on. With ours on Oracle this can happen when there is
a network glitch hence the metadata tables can get out of sync. Each table has a Materialized
view (MV) log that keeps the deltas for that table and pushes the deltas to the replicate
table every say 30 seconds (configurable). So this is the scenario


1.    Network issue. Data cannot be delivered (deltas) and the replicate table is out of sync.
The replicated table data is kept in the primary table MV log until the network is back and
the next scheduled refresh delivers it. There could be a backlog

2.    The replicated table gets out of sync. In this case Oracle package DBMS_MVIEW.REFRESH
is used to sync the replicate table. Again best done when there is no activity in the primary



We use Oracle for our metastore as the Bank has many instances of Oracle, Sybase, Microsoft
SQL server and it is pretty easy for DBAs to look after a small Hive schema on an Oracle instance.


I gather if we build a model based on what classic databases do to keep reporting database
tables in sync (which is in essence what we are talking about) then we should be OK.


That takes care of metadata but I noticed that you are also mentioning synching data on HDFS
in the replicate as well. Sounds like many people go for DistCp <>
 — an application shipped with Hadoop that uses a MapReduce job to copy files in parallel.
There seems to be a good article here <>
 on general replication for Facebook.






Mich Talebzadeh


Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly <> 


NOTE: The information in this email is proprietary and confidential. This message is for the
designated recipient only, if you are not the intended recipient, you should destroy it immediately.
Any information in this message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility
of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd,
its subsidiaries nor their employees accept any responsibility.


From: Elliot West [] 
Sent: 17 December 2015 17:17
Subject: Re: Synchronizing Hive metastores across clusters


Hi Mich,


In your scenario is there any coordination of data syncing on HDFS and metadata in HCatalog?
I.e. could a situation occur where the replicated metastore shows a partition as 'present'
yet the data that backs the partition in HDFS has not yet arrived at the replica filesystem?
I Imagine one could avoid this by snapshotting the source metastore, then syncing HDFS, and
then finally shipping the snapshot to the replica(?).


Thanks - Elliot.


On 17 December 2015 at 16:57, Mich Talebzadeh < <>
> wrote:

Sounds like one way replication of metastore. Depending on your metastore platform that could
be achieved pretty easily. 


Mine is Oracle and I use Materialised View replication which is pretty good but no latest
technology. Others would be GoldenGate or SAP replication server.






From: Mich Talebzadeh [ <> ] 
Sent: 17 December 2015 16:47
To: <> 
Subject: RE: Synchronizing Hive metastores across clusters


Are both clusters in active/active mode or the cloud based cluster is standby?


From: Elliot West [] 
Sent: 17 December 2015 16:21
To: <> 
Subject: Synchronizing Hive metastores across clusters




I'm thinking about the steps required to repeatedly push Hive datasets out from a traditional
Hadoop cluster into a parallel cloud based cluster. This is not a one off, it needs to be
a constantly running sync process. As new tables and partitions are added in one cluster,
they need to be synced to the cloud cluster. Assuming for a moment that I have the HDFS data
syncing working, I'm wondering what steps I need to take to reliably ship the HCatalog metadata
across. I use HCatalog as the point of truth as to when when data is available and where it
is located and so I think that metadata is a critical element to replicate in the cloud based


Does anyone have any recommendations on how to achieve this in practice? One issue (of many
I suspect) is that Hive appears to store table/partition locations internally with absolute,
fully qualified URLs, therefore unless the target cloud cluster is similarly named and configured
some path transformation step will be needed as part of the synchronisation process.


I'd appreciate any suggestions, thoughts, or experiences related to this.


Cheers - Elliot.




View raw message