hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sushanth Sowmyan (JIRA)" <>
Subject [jira] [Commented] (HIVE-12285) Add locking to HCatClient
Date Thu, 29 Oct 2015 23:21:27 GMT


Sushanth Sowmyan commented on HIVE-12285:

Reading your comments brought upon a sad smile, [~teabot] :)

As with any old project, intentions that are still half-implemented and legacy compatibility
wind up causing issues eventually. Eugene answered most of the specific yes/no aspects of
your questions, so I'll try to ramble on and give some historical context. :)

HCatalog has been a loose collection of api points, wherein, the original goal behind HCatalog
was to be a metastore-based storage abstraction layer for all of hadoop, not just hive. To
that end, in essence, the original architecture goal for HCatalog was to replace Hive's metastore
and StorageHandler subsystems, so that hive would sit on top of hcatalog, and hcatalog would
sit on top of M/R. In addition, the goal was to add multiple api points for HCatalog so that
products other than hive, such as pig or custom mapreduce programs could share the same backend.

Now, the way integration with Hive wound up going down, most of the changes that happened
with the hcat metastore wound up being contributed back into the hive metastore, which remained
a separate entity. And, in addition, Instead of HCat replacing Hive's StorageHandler, since
there was a lot of disagreement in the community, and since cross-tool compatibility was still
a primary goal, we wound up going the route where HCat will plug in and use Hive's StorageHandler
systems (with a bit of enhancement to them that was added along the way). So, now, instead
of HCat being a common core, it sat in parallel with hive, using the same bits hive did, but
in a repeat-implementation sort of manner, and its primary user is not hive, but other tools
like pig , custom M/R jobs, etc.

WebHCat was an attempt to have a gateway service that allowed you to do various table management
functions, some minor scheduling services, and was intended to act as a secure REST endpoint
that people could use, and so it does what it does. However, in trying to do all it does,
I think that as of today, Oozie or Hue might be more of use than WebHCat.

hcat CLI was something that, initially, used to mimic hive CLI, but used to perform the task
of being aware of HCat's StorageDriver system in addition to supporting traditional IF/OF
systems. However, with HCat adopting hive's StorageHandler concept, and deprecating and removing
StorageDrivers, hcat CLI is a thin dupe of the hive CLI, except for one thing it did differently
- it allowed easy blocking off of non-DDL commands. Thus, hcat CLI would always run only pure
hive code, with no user-defined classes ever being loaded. This makes hcat CLI more trustable
in a secure environment as a privileged user. Thus, despite thoughts of deprecating and removing
the hcat CLI, it lived on for this purpose, and WebHCat would run hcat CLI for all its DDL
actions behind the scenes, so hcat CLI lived on in this limited context.

Associated with that, WebHCat was trying to specify a java client to talk to it, and the notion
was to try and determine a "proper" API that would allow a user to connect to WebHCat, and
not need any traditional Hive jars on the client side. This specification is what HCatClient
eventually wound up being.

Once the specification was in, the next goal was to come up with a proper client that implemented
HCatClient and talked to WebHCat. However, due to a lack of interest among users in such a
thing, the only implementation of HCatClient that existed was HMSHCatClient, which was initially
only intended to test the HCatClient interface, not to be the only implementation of that

The HCatClient interface, however, has proven to be popular enough an interface that it has
attracted a lot of users, and it has been a useful interface for us to suggest to people to
use as well, because it means external tools like Falcon can use an abstracted interface like
HCatClient, rather than be tied to interfaces like IMetaStoreClient which we would prefer
to be interfaces internal to hive.

However, we (a) do not have an impl of HCatClient other than the HMSImpl and (b) the original
goal of wanting to have jar-separation was not a goal that found enough traction.

About two years back, I suggested deprecating all of the webhcat-java-client package, with
a view to replacing it with a top-level hive-api package which would contain equivalent APIs
that are intended for public consumption. This was met with some balking from the community,
so we have the packages and sporadic spread of api as we currently do.

At this point, I do not think WebHCat has very many users itself, and could probably use being
spun off out of hive to trim and clean up hive.

> Add locking to HCatClient
> -------------------------
>                 Key: HIVE-12285
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>          Components: HCatalog
>    Affects Versions: 2.0.0
>            Reporter: Elliot West
>            Assignee: Elliot West
>              Labels: concurrency, hcatalog, lock, locking, locks
> With the introduction of a concurrency model (HIVE-1293) Hive uses locks to coordinate
 access and updates to both table data and metadata. Within the Hive CLI such lock management
is seamless. However, Hive provides additional APIs that permit interaction with data repositories,
namely the HCatalog APIs. Currently, operations implemented by this API do not participate
with Hive's locking scheme. Furthermore, access to the locking mechanisms is not exposed by
the APIs (as is the case with the Metastore Thrift API) and so users are not able to explicitly
interact with locks either. This has created a less than ideal situation where users of the
APIs have no choice but to manipulate these data repositories outside of the command of Hive's
lock management, potentially resulting in situations where data inconsistencies can occur
both for external processes using the API and for queries executing within Hive.
> h3. Scope of work
> This ticket is concerned with sections of the HCatalog API that deal with DDL type operations
using the metastore, not with those whose purpose is to read/write table data. A separate
issue already exists for adding locking to HCat readers and writers (HIVE-6207).
> h3. Proposed work
> The following work items would serve as a minimum deliverable that would both allow API
users to effectively work with locks:
> * Comprehensively document on the wiki the locks required for various Hive operations.
At a minimum this should cover all operations exposed by {{HCatClient}}. The [Locking design
document|] can be used as a starting
point or perhaps updated.
> * Implement methods and types in the {{HCatClient}} API that allow users to manipulate
Hive locks. For the most part I'd expect these to delegate to the metastore API implementations:
> ** {{org.apache.hadoop.hive.metastore.IMetaStoreClient.lock(LockRequest)}}
> ** {{org.apache.hadoop.hive.metastore.IMetaStoreClient.checkLock(long)}}
> ** {{org.apache.hadoop.hive.metastore.IMetaStoreClient.unlock(long)}}
> ** -{{org.apache.hadoop.hive.metastore.IMetaStoreClient.showLocks()}}-
> ** {{org.apache.hadoop.hive.metastore.IMetaStoreClient.heartbeat(long, long)}}
> ** {{org.apache.hadoop.hive.metastore.api.LockComponent}}
> ** {{org.apache.hadoop.hive.metastore.api.LockRequest}}
> ** {{org.apache.hadoop.hive.metastore.api.LockResponse}}
> ** {{org.apache.hadoop.hive.metastore.api.LockLevel}}
> ** {{org.apache.hadoop.hive.metastore.api.LockType}}
> ** {{org.apache.hadoop.hive.metastore.api.LockState}}
> ** -{{org.apache.hadoop.hive.metastore.api.ShowLocksResponse}}-
> h3. Additional proposals
> Explicit lock management should be fairly simple to add to {{HCatClient}}, however it
puts the onus on the API user to correctly understand and implement code that uses lock in
an appropriate manner. Failure to do so may have undesirable consequences. With a simpler
user model the operations exposed on the API would automatically acquire and release the locks
that they need. This might work well for small numbers of operations, but not perhaps for
large sequences of invocations. (Do we need to worry about this though as the API methods
usually accept batches?).  Additionally tasks such as heartbeat management could also be handled
implicitly for long running sets of operations. With these concerns in mind it may also be
beneficial to deliver some of the following:
> * A means to automatically acquire/release appropriate locks for {{HCatClient}} operations.
> * A component that maintains a lock heartbeat from the client.
> * A strategy for switching between manual/automatic lock management, analogous to SQL's
{{autocommit}} for transactions.
> An API for lock and heartbeat management already exists in the HCatalog Mutation API
(see: {{org.apache.hive.hcatalog.streaming.mutate.client.lock}}). It will likely make sense
to refactor either this code and/or code that uses it.

This message was sent by Atlassian JIRA

View raw message