hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sushanth Sowmyan <>
Subject Re: Synchronizing Hive metastores across clusters
Date Thu, 17 Dec 2015 20:03:55 GMT
Also, while I have not wiki-ized the documentation for the above, I
have uploaded slides from talks that I've given in hive user group
meetup on the subject, and also a doc that describes the replication
protocol followed for the EXIM replication that are attached over at

On Thu, Dec 17, 2015 at 11:59 AM, Sushanth Sowmyan <> wrote:
> Hi,
> I think that the replication work added with
> is exactly up this
> alley.
> Per Eugene's suggestion of MetaStoreEventListener, this replication
> system plugs into that and gets you a stream of notification events
> from HCatClient for the exact purpose you mention.
> There's some work still outstanding on this task, most notably
> documentation (sorry!) but please have a look at
> HCatClient.getReplicationTasks(...) and
> org.apache.hive.hcatalog.api.repl.ReplicationTask. You can plug in
> your implementation of  ReplicationTask.Factory to inject your own
> logic for how to handle the replication according to your needs.
> (currently there exists an implementation that uses Hive EXPORT/IMPORT
> to perform replication - you can look at the code for this, and the
> tests for these classes to see how that is achieved. Falcon already
> uses this to perform cross-hive-warehouse replication)
> Thanks,
> -Sushanth
> On Thu, Dec 17, 2015 at 11:22 AM, Eugene Koifman
> <> wrote:
>> Metastore supports MetaStoreEventListener and MetaStorePreEventListener
>> which may be useful here
>> Eugene
>> From: Elliot West <>
>> Reply-To: "" <>
>> Date: Thursday, December 17, 2015 at 8:21 AM
>> To: "" <>
>> Subject: Synchronizing Hive metastores across clusters
>> Hello,
>> I'm thinking about the steps required to repeatedly push Hive datasets out
>> from a traditional Hadoop cluster into a parallel cloud based cluster. This
>> is not a one off, it needs to be a constantly running sync process. As new
>> tables and partitions are added in one cluster, they need to be synced to
>> the cloud cluster. Assuming for a moment that I have the HDFS data syncing
>> working, I'm wondering what steps I need to take to reliably ship the
>> HCatalog metadata across. I use HCatalog as the point of truth as to when
>> when data is available and where it is located and so I think that metadata
>> is a critical element to replicate in the cloud based cluster.
>> Does anyone have any recommendations on how to achieve this in practice? One
>> issue (of many I suspect) is that Hive appears to store table/partition
>> locations internally with absolute, fully qualified URLs, therefore unless
>> the target cloud cluster is similarly named and configured some path
>> transformation step will be needed as part of the synchronisation process.
>> I'd appreciate any suggestions, thoughts, or experiences related to this.
>> Cheers - Elliot.

View raw message