hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Larson, Kurt" <>
Subject RE: Hive federation service
Date Thu, 27 Jul 2017 15:40:03 GMT
Hi Carter and Elliot,

First off:

Carter, as the JDBC endpoint is serviced by the HiveServer2 service and not the Hive Metastore
Service (HMS), I’d assume that the answer to your question is no and that you’d still
need your own HiveServer2 to interact with the Waggle-Dance HMS proxy to process your JDBC
API requests.

Waggle-Dance question:

As the Waggle-Dance diagram shows only the HMS thrift API being federated, how is access to
all the data that the LOCATION properties of all Hive database objects points to.  It seems
that Waggle-Dance goes to great lengths to navigate the network topology to get from the proxy
to the remote HMSs.  However, there’s no mention of where the data is stored.  Clearly if
all the remote HMSs store their data in a common service, like AWS S3 or Azure Blob Storage,
it will be easier for the HMS proxy consumers to access it, but may still be configuration
challenges of multiple accounts and different permissions and roles.  If each remote HMSs
store their data in separate local distributed file systems, like HDFS clusters, or a mix
of the 2, there are additional network topology challenges similar to get to the HMSs themselves.
 Is there any solution or consideration for federated data access?


From: Carter Shanklin []
Sent: Thursday, July 27, 2017 10:57 AM
Subject: Re: Hive federation service


Interesting stuff

I have 3 questions
1. Can Waggle Dance deal with multiple kerberized Hadoop clusters?
2. Do you support 3 layers in the hierarchy (i.e. cluster.database.table) or 2 layers, with
a requirement to avoid any possible name collisions in the mapping layer.
3. Is it compatible with JDBC? It wasn't clear to me since the diagrams all mention thrift.


From: Elliot West <<>>
Reply-To: "<>" <<>>
Date: Thursday, July 27, 2017 at 06:21
To: "<>" <<>>
Subject: Hive federation service


We've recently contributed our Hive federation service to the open source community:

Waggle Dance is a request routing Hive metastore proxy that allows tables to be concurrently
accessed across multiple Hive deployments. It was created to tackle the appearance of the
dataset silos that arose as our large organization gradually migrated from monolithic on-premises
clusters, to cloud based platforms.

In short, Waggle Dance enables a unified end point with which you can describe, query, and
join tables that may exist in multiple distinct Hive deployments. Such deployments may exist
in disparate regions, accounts, or clouds (security and network permitting). Dataset access
is not limited to the Hive query engine, and should work with any Hive metastore enabled platform.
We've been successfully using it with Spark for example.

More recently we've employed Waggle Dance to apply a simple security layer to cloud based
platforms such as Qubole, Databricks, and EMR. These currently provide no means to construct
cross platform authentication and authorization strategies. Therefore we use a combination
of Waggle Dance and network configuration to restrict writes and destructive Hive operations
to specific user groups and applications.

We currently operate many disparate Hive metastore instances whose tables must be shared across
the organization. Therefore we are committed to the ongoing development of this project. However,
should such federation features have broader appeal, we'd be keen to see similar features
integrated into Hive, perhaps in a more accessible form echoing existing remote table or database
link features present in some traditional RDBMSes.

All feedback appreciated, many thanks for your time,

View raw message