hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Elliot West (JIRA)" <>
Subject [jira] [Commented] (HIVE-12285) Add locking to HCatClient
Date Thu, 29 Oct 2015 14:51:27 GMT


Elliot West commented on HIVE-12285:

I've been trying to understand the architecture of HCatalog a little better, specifically
the endpoints, in the hope that I can find one central location at which it'd make sense to
apply locking in an automatic fashion. However it seems as though different endpoints take
different approaches. I think I have identified the following invocation hierarchies:

* *Rest API:* WebHCat → {{hcat}} CLI → {{Hive}} class → metastore
* *Command line:* {{hcat}} CLI → {{Hive}} class → metastore
* *Java client:* {{HCatClient}} → metastore

As far as I can tell, none of DDL operations in these code paths participate with any locking
code (yet they should). The only common ancestor is the metastore but this seems a risky location
to apply such a change as it is so broadly used and seems to be at the wrong level of abstraction.
It might be possible to introduce a common layer above the metastore and move all endpoints
over to this, but this would be a very large refactoring job.

Finally, I'm slightly confused by the original motivations behind {{HCatClient}}. The project
name {{hive-webhcat-java-client}} suggests that this was perhaps intended to be a Java client
to the REST API, yet all the code in this project seems to bypass the REST API completely.

> Add locking to HCatClient
> -------------------------
>                 Key: HIVE-12285
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>          Components: HCatalog
>    Affects Versions: 2.0.0
>            Reporter: Elliot West
>            Assignee: Elliot West
>              Labels: concurrency, hcatalog, lock, locking, locks
> With the introduction of a concurrency model (HIVE-1293) Hive uses locks to coordinate
 access and updates to both table data and metadata. Within the Hive CLI such lock management
is seamless. However, Hive provides additional APIs that permit interaction with data repositories,
namely the HCatalog APIs. Currently, operations implemented by this API do not participate
with Hive's locking scheme. Furthermore, access to the locking mechanisms is not exposed by
the APIs (as is the case with the Metastore Thrift API) and so users are not able to explicitly
interact with locks either. This has created a less than ideal situation where users of the
APIs have no choice but to manipulate these data repositories outside of the command of Hive's
lock management, potentially resulting in situations where data inconsistencies can occur
both for external processes using the API and for queries executing within Hive.
> h3. Scope of work
> This ticket is concerned with sections of the HCatalog API that deal with DDL type operations
using the metastore, not with those whose purpose is to read/write table data. A separate
issue already exists for adding locking to HCat readers and writers (HIVE-6207).
> h3. Proposed work
> The following work items would serve as a minimum deliverable that would both allow API
users to effectively work with locks:
> * Comprehensively document on the wiki the locks required for various Hive operations.
At a minimum this should cover all operations exposed by {{HCatClient}}. The [Locking design
document|] can be used as a starting
point or perhaps updated.
> * Implement methods and types in the {{HCatClient}} API that allow users to manipulate
Hive locks. For the most part I'd expect these to delegate to the metastore API implementations:
> ** {{org.apache.hadoop.hive.metastore.IMetaStoreClient.lock(LockRequest)}}
> ** {{org.apache.hadoop.hive.metastore.IMetaStoreClient.checkLock(long)}}
> ** {{org.apache.hadoop.hive.metastore.IMetaStoreClient.unlock(long)}}
> ** -{{org.apache.hadoop.hive.metastore.IMetaStoreClient.showLocks()}}-
> ** {{org.apache.hadoop.hive.metastore.IMetaStoreClient.heartbeat(long, long)}}
> ** {{org.apache.hadoop.hive.metastore.api.LockComponent}}
> ** {{org.apache.hadoop.hive.metastore.api.LockRequest}}
> ** {{org.apache.hadoop.hive.metastore.api.LockResponse}}
> ** {{org.apache.hadoop.hive.metastore.api.LockLevel}}
> ** {{org.apache.hadoop.hive.metastore.api.LockType}}
> ** {{org.apache.hadoop.hive.metastore.api.LockState}}
> ** -{{org.apache.hadoop.hive.metastore.api.ShowLocksResponse}}-
> h3. Additional proposals
> Explicit lock management should be fairly simple to add to {{HCatClient}}, however it
puts the onus on the API user to correctly understand and implement code that uses lock in
an appropriate manner. Failure to do so may have undesirable consequences. With a simpler
user model the operations exposed on the API would automatically acquire and release the locks
that they need. This might work well for small numbers of operations, but not perhaps for
large sequences of invocations. (Do we need to worry about this though as the API methods
usually accept batches?).  Additionally tasks such as heartbeat management could also be handled
implicitly for long running sets of operations. With these concerns in mind it may also be
beneficial to deliver some of the following:
> * A means to automatically acquire/release appropriate locks for {{HCatClient}} operations.
> * A component that maintains a lock heartbeat from the client.
> * A strategy for switching between manual/automatic lock management, analogous to SQL's
{{autocommit}} for transactions.
> An API for lock and heartbeat management already exists in the HCatalog Mutation API
(see: {{org.apache.hive.hcatalog.streaming.mutate.client.lock}}). It will likely make sense
to refactor either this code and/or code that uses it.

This message was sent by Atlassian JIRA

View raw message