hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sushanth Sowmyan <khorg...@gmail.com>
Subject Re: Synchronizing Hive metastores across clusters
Date Thu, 17 Dec 2015 20:03:55 GMT
Also, while I have not wiki-ized the documentation for the above, I
have uploaded slides from talks that I've given in hive user group
meetup on the subject, and also a doc that describes the replication
protocol followed for the EXIM replication that are attached over at
https://issues.apache.org/jira/browse/HIVE-10264

On Thu, Dec 17, 2015 at 11:59 AM, Sushanth Sowmyan <khorgath@gmail.com> wrote:
> Hi,
>
> I think that the replication work added with
> https://issues.apache.org/jira/browse/HIVE-7973 is exactly up this
> alley.
>
> Per Eugene's suggestion of MetaStoreEventListener, this replication
> system plugs into that and gets you a stream of notification events
> from HCatClient for the exact purpose you mention.
>
> There's some work still outstanding on this task, most notably
> documentation (sorry!) but please have a look at
> HCatClient.getReplicationTasks(...) and
> org.apache.hive.hcatalog.api.repl.ReplicationTask. You can plug in
> your implementation of  ReplicationTask.Factory to inject your own
> logic for how to handle the replication according to your needs.
> (currently there exists an implementation that uses Hive EXPORT/IMPORT
> to perform replication - you can look at the code for this, and the
> tests for these classes to see how that is achieved. Falcon already
> uses this to perform cross-hive-warehouse replication)
>
>
> Thanks,
>
> -Sushanth
>
> On Thu, Dec 17, 2015 at 11:22 AM, Eugene Koifman
> <ekoifman@hortonworks.com> wrote:
>> Metastore supports MetaStoreEventListener and MetaStorePreEventListener
>> which may be useful here
>>
>> Eugene
>>
>> From: Elliot West <teabot@gmail.com>
>> Reply-To: "user@hive.apache.org" <user@hive.apache.org>
>> Date: Thursday, December 17, 2015 at 8:21 AM
>> To: "user@hive.apache.org" <user@hive.apache.org>
>> Subject: Synchronizing Hive metastores across clusters
>>
>> Hello,
>>
>> I'm thinking about the steps required to repeatedly push Hive datasets out
>> from a traditional Hadoop cluster into a parallel cloud based cluster. This
>> is not a one off, it needs to be a constantly running sync process. As new
>> tables and partitions are added in one cluster, they need to be synced to
>> the cloud cluster. Assuming for a moment that I have the HDFS data syncing
>> working, I'm wondering what steps I need to take to reliably ship the
>> HCatalog metadata across. I use HCatalog as the point of truth as to when
>> when data is available and where it is located and so I think that metadata
>> is a critical element to replicate in the cloud based cluster.
>>
>> Does anyone have any recommendations on how to achieve this in practice? One
>> issue (of many I suspect) is that Hive appears to store table/partition
>> locations internally with absolute, fully qualified URLs, therefore unless
>> the target cloud cluster is similarly named and configured some path
>> transformation step will be needed as part of the synchronisation process.
>>
>> I'd appreciate any suggestions, thoughts, or experiences related to this.
>>
>> Cheers - Elliot.
>>
>>

Mime
View raw message