pig-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Pig Wiki] Update of "Howl/HowlAuthentication" by AlanGates
Date Mon, 06 Dec 2010 21:41:22 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The "Howl/HowlAuthentication" page has been changed by AlanGates.
http://wiki.apache.org/pig/Howl/HowlAuthentication

--------------------------------------------------

New page:
This page will enlist use cases for authentication related to Howl. It will attempt to outline
the required changes to enable those use cases.

== Background and terminology ==

The Hadoop Security (!HadoopS) release uses Kerberos to provide authentication. On a secure
cluster, the cluster servers (Namenode (nn), Jobtracker (jt), datanode, tasktracker) are themselves
Kerberos (service) principals and end users are user principals and users and these services
mutually authenticate to each other using Kerberos tickets. !HadoopS uses security tokens
called "delegation tokens" (these are NOT Kerberos tickets but a Hadoop specific security
token) to authenticate the map/reduce tasks. So at job submission time once the job client
has provided the user Kerberos ticket to authenticate to the namenode and jobtracker, it is
handed delegation tokens from the namenode so that the tasks can use these to talk to the
namenode. These delegation tokens are stored in the "credential store" for the job and the
job tracker automatically renews these for the job upto a maximum lifetime of 7 days.

=== Oozie use case ===
Oozie is a service which users use to submit jobs to the !HadoopS cluster. It somewhat resembles
the Howl server since the Howl server also needs to act on behalf of users while accessing
the !DFS. Users authenticate to oozie and then the oozie service acts on behalf of the user
while working with job tracker or namenode. For this to work, both the namenode and jobtracker
need to recognize the "oozie" prinicpal to be a "proxy user" principal (i.e. a principal that
can act on behalf of other users). In addition namenode and jobtracker need to know the possibles
IPs for the proxy user service, list of users or groups (.i.e all users belonging to the group
would be allowed) that the oozie principal can act on behalf of. This proxy user list and
associated information is maintained in a configuration read by the namenode and jobtracker.
Once the user authenticates to oozie, oozie authenticates itself to nn/jt using the oozie
principal and also uses the !UserGroupInformation.doAs() to secure a !JobClient object associated
with the real user (it needs the real username for the doAs() which it gets hold of from the
user authentication). Through this process, oozie adds delegation tokens  (actually the !JobClient
code does this in a subsequent submitJob()) for the jt and primary nn into the new !JobClient
to pass on to the launcher map task for the Pig/MR job. If the Pig script/MR job run needs
to access more than the primary name node, an oozie parameter should be used to specify the
list of nns that need to be accessed and oozie will get delegation tokens for all of them
through the jobclient.

== Changes required in Howl ==
   * Howl server will need to run as a proxy user prinicpal. So at deployment time, the configuration
of nn and jt will need to be updated to recognize the "howl" principal as as "proxy user"
principal. A "howl" net group (similar to oozie) will be needed and all users who want to
use Howl will need to add themselves to the "howl" group.
   * Howl server will also need to hand out delegation tokens (like the nn) so that the output
committer task can use them to authenticate to the Howl server to "publish" partitions. Apart
from the output committer, oozie will also request Howl delegation tokens and hand them to
the corresponding Pig/mapred jobs.
   * End users of Howl using Pig/Hive/Map Reduce/Howl cli (and not using oozie) would authenticate
to Howl using Kerberos tickets in the thift api calls. As noted in the point above, the output
committer task would authenticate to the Howl server using the Howl delegation token in the
publish_parition api call. So the thrift calls need to support both Kerberos based and delegation
token based authentication. '''There should be a property which is honored to run metastore
without any authentication, preferably this should be the same property that Hadoop uses for
non secure operation.'''
   * Howl server code should change to implement !UserGroupInformation.doAs() so as to do
all operations as the real user. The real user's username would be needed to invoke doAs()
(Hopefully there is some way to get this from the Kerberos ticket with which the user authenticated.)

   * !HowlOutputFormat will need to get delegation tokens from the Howl server in checkOutputSpecs()
and store the token into the Hadoop credential store so that it can be passed to the tasks.
Specifically the !OutputCommitter task will use this token to authenticate to the Howl server
to invoke the publish_partition API call.
   * The JT should renew the Howl delegation token so it is kept valid for long running jobs
(this might be difficult since JT will need to make thrift call to renew the delegation token.
 For the short term we will simply set the timeout on these delegation tokens to be long.
 In the future the JT can handle renewing them.


== Use cases with Howl ==

=== Howl client running DDL commands ===
 * A user does kinit to acquire Kerberos ticket - this gets him the TGT (ticket granting ticket)
 * The Howl client needs to acquire the service ticket to access the Howl service (This will
happen transparently through !HiveMetaStoreClient). This service ticket is used to authenticate
the user to the Howl server.
 * The Howl server after authenticating the user does a !UserGroupInformation.doAs() call
using the real user's username to peform the action requested.

=== Pig script reading from and writing to tables in Howl ===
 * A user does kinit to acquire Kerberos ticket - this gets him the TGT (ticket granting ticket)
 * The !HowlInputFormat needs to acquire the service ticket to access the Howl service  (This
will happen transparently through !HiveMetaStoreClient) . This service ticket is used to authenticate
the user to the Howl server.
 * !HowlOutputFormat will need to get delegation tokens from the Howl server in checkOutputSpecs()
and store the token into the Hadoop credential store so that it can be passed to the tasks.
Specifically the !OutputCommitter task will use this token to authenticate to the Howl server
to invoke the publish_partition API call. 
  
=== Hive query reading from and writing to tables in Howl ===
 * A user does kinit to acquire Kerberos ticket - this gets him the TGT (ticket granting ticket)
 * The Hive client needs to acquire the service ticket to access the Howl service (This will
happen transparently through !HiveMetaStoreClient). This service ticket is used to authenticate
the user to the Howl server.

=== Java Map Reduce job reading from and writing to tables in Howl  ===
 * Same as Pig use case?

=== Oozie running a Pig script which reads from or writes to tables in Howl ===
'''How will Oozie know that the Pig script interacts with Howl - will need some change in
oozie to allow the work flow xml to indicate this?'''
 * Once oozie knows that the Pig script may read/write through Howl (maybe through some information
in the workflow xml), it should also authenticate to the Howl server and get the Howl delegation
token on behalf of the real user (in addition to the usual jt/nn delegation tokens it gets
by doing doAs() for creating the jobclient). The Howl delegation token should be added on
to the launcher task so it is available on the map task launching the Pig script
 * The !HowlInputFormat/!HowlOutputFormat code will use the delegation tokens already present
to authenticate to Howl server. 
 * The Howl delegation token should get sent to the actual map/reduce tasks of the Pig job
and also specifically to an !OutputCommitter task so that it can use it to publish partition
to the Howl server.
 
=== Oozie running a Java MR job which reads from or writes to tables in Howl ===
'''How will Oozie know that the Java MR job interacts with Howl - will need some change in
oozie to allow the work flow xml to indicate this?'''
 * Same as Pig?

=== Tools like DAQ invoke Howl API calls to register data ===
 * These services would simply use their Kerberos tickets to authenticate in the thrift API
calls. Apparently DAQ runs as proxy user and hence DAQ's use case would be similar to the
oozie one.

Mime
View raw message