drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5510) Revisit connection failure recovery in Hive storage plugin
Date Mon, 15 May 2017 17:54:04 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16010989#comment-16010989

Paul Rogers commented on DRILL-5510:

More details. The Hive client in the Hive storage plugin is not designed to handle security.

* When we start the Hive storage plugin, we create a single instance of the {{HiveSchemaFactory}}.
* {{HiveSchemaFactory}} holds on to a {{DrillHiveMetaStoreClient}} connection. In the secure
case, this connection is used to get security certificates for us in creating secure connections.
* {{HiveSchemaFactory}} has a Guava loading cache of user-specific, secure connections.

When the Hive metastore goes down, all connections become invalid including the non-secure
and all the secure connections. But, we try to handle the problem as follows.

If a secure connection times out:

* Use the (now-invalid) insecure connection to get another ticket. But, since this isn't valid,
we can't reconnect and so always fail.

If we try to use a cached secure connection before timeout, then this happens:

* Try to send a message.
* When that fails, try to reconnect (using the old certificate for the prior session.)
* When that fails, give up.

What we really need to do is:

* Recreate both the insecure *and* secure connections.

But, since the secure connection cache is held on the insecure connection, we can't easily
recreate that connection: we'd get a new object.

So, we have to make some changes.

* Hold the secure connection cache on an object other than a connection.
* Use a connection proxy instead of the connection as key to the cache. The proxy allows maintaining
the cache entry, but replacing the secure connection with a new one. (The proxy is just a
wrapper around a replacable secure connection.)
* Similarly, provide a thread-safe way to reconnect the non-secure connection used to get
tickets for the secure connection.

All this is not a huge project, but it is more than can be done in the context of simple bug
fix for DRILL-5496. So, for that ticket, I used a hack: just throw away the entire schema
builder and create a new one. But, that solution requires synchronizing all requests and is
far from ideal. This ticket is a request to create a better long-term solution.

> Revisit connection failure recovery in Hive storage plugin
> ----------------------------------------------------------
>                 Key: DRILL-5510
>                 URL: https://issues.apache.org/jira/browse/DRILL-5510
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.11.0
>            Reporter: Paul Rogers
> DRILL-5496 describes a problem which occurs when the Hive metastore server is restarted
while Drill runs. The solution in that ticket is a work-around: we discard all cached Hive
metastore data and rebuild the metadata cache.
> The original code tried to be more subtle: detecting that the connection has failed,
reconnect, but preserve the cache. DRILL-5496 describes the flaws in that approach for the
secure connection case.
> This ticket asks to spend the time to understand the Hive metadata code and restructure
it to preserve the cache across connection failures.
> Note a subtle issue: if the Hive metastore goes down, when it comes back up, it may contain
different data; anything could happen while the server is down: upgrade schemas, replace one
schema with another, etc. So, the caching mechanism, if it is to preserve data across reconnects,
must handle such changes.
> Of course, such changes could occur even within a single connection, so the code should
handle such cases already.

This message was sent by Atlassian JIRA

View raw message