hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sushanth Sowmyan <>
Subject Re: Synchronizing Hive metastores across clusters
Date Thu, 17 Dec 2015 19:59:57 GMT

I think that the replication work added with is exactly up this

Per Eugene's suggestion of MetaStoreEventListener, this replication
system plugs into that and gets you a stream of notification events
from HCatClient for the exact purpose you mention.

There's some work still outstanding on this task, most notably
documentation (sorry!) but please have a look at
HCatClient.getReplicationTasks(...) and
org.apache.hive.hcatalog.api.repl.ReplicationTask. You can plug in
your implementation of  ReplicationTask.Factory to inject your own
logic for how to handle the replication according to your needs.
(currently there exists an implementation that uses Hive EXPORT/IMPORT
to perform replication - you can look at the code for this, and the
tests for these classes to see how that is achieved. Falcon already
uses this to perform cross-hive-warehouse replication)



On Thu, Dec 17, 2015 at 11:22 AM, Eugene Koifman
<> wrote:
> Metastore supports MetaStoreEventListener and MetaStorePreEventListener
> which may be useful here
> Eugene
> From: Elliot West <>
> Reply-To: "" <>
> Date: Thursday, December 17, 2015 at 8:21 AM
> To: "" <>
> Subject: Synchronizing Hive metastores across clusters
> Hello,
> I'm thinking about the steps required to repeatedly push Hive datasets out
> from a traditional Hadoop cluster into a parallel cloud based cluster. This
> is not a one off, it needs to be a constantly running sync process. As new
> tables and partitions are added in one cluster, they need to be synced to
> the cloud cluster. Assuming for a moment that I have the HDFS data syncing
> working, I'm wondering what steps I need to take to reliably ship the
> HCatalog metadata across. I use HCatalog as the point of truth as to when
> when data is available and where it is located and so I think that metadata
> is a critical element to replicate in the cloud based cluster.
> Does anyone have any recommendations on how to achieve this in practice? One
> issue (of many I suspect) is that Hive appears to store table/partition
> locations internally with absolute, fully qualified URLs, therefore unless
> the target cloud cluster is similarly named and configured some path
> transformation step will be needed as part of the synchronisation process.
> I'd appreciate any suggestions, thoughts, or experiences related to this.
> Cheers - Elliot.

View raw message