Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@kafka.apache.org
Content-Type: text/plain; charset=windows-1252
Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\))
Subject: Re: Two open issues on Kafka security
From: Don Bosco Durai <bosco@apache.org>
In-Reply-To: 
 <1412271888.2138031.174461285.55703E8D@webmail.messagingengine.com>
Date: Thu, 2 Oct 2014 10:54:27 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <6B6B17A6-E5B9-413D-A6BE-E57CC1B1EBAC@apache.org>
References: 
 <1412271888.2138031.174461285.55703E8D@webmail.messagingengine.com>
To: dev@kafka.apache.org

I agree, username+IP would be sufficient. I assume, when authentication =
is turned off or doesn=92t exist, but authorization plugin is enabled, =
then username would be empty or passed as =93nobody=94, but with valid =
IP (if available).

> The name =93context" is probably not the right one. The idea is to =
have an
> object into which we can easily add additional properties in the =
future
> to support additional authorization libraries without breaking =
backward
> compatibility with existing ones.
+1. Makes the design scalable.

Thanks

Bosco


>=20
>=20
> ----- Original message -----
> From: Jarek Jarcec Cecho <jarcec@apache.org>
> To: dev@kafka.apache.org
> Subject: Re: Two open issues on Kafka security
> Date: Thu, 2 Oct 2014 08:33:45 -0700
>=20
> Thanks for getting back Jay!
>=20
> For the interface - Looking at Sentry and other authorization =
libraries
> in the Hadoop eco system it seems that =93username=94 is primarily use =
to
> perform authorization these days. And then IP for auditing. Hence I =
feel
> that username+IP would be sufficient, at least for now. However I =
would
> assume that in the future we might need more then just those two, so
> what about defining the API in a way that we can easily extend in the
> future, something like?
>=20
> authorize(Context, Entity, Action), where
>=20
> * Action - is the action that user is trying to do (read to topic, =
read
> from topic, create topic, =85)
> * Entity - given entity that user is trying to perform that action on
> (topic, =85)
> * Context - container with user/session information - user name, IP
> address or perhaps entire certificate as was suggested early on the
> email thread.
>=20
> The name =93context" is probably not the right one. The idea is to =
have an
> object into which we can easily add additional properties in the =
future
> to support additional authorization libraries without breaking =
backward
> compatibility with existing ones.
>=20
> The hierarchy is interesting topic - I=92m not familiar enough with =
Kafka
> internals so I can=92t really talk about how much more complex it =
would
> be. I can speak about Sentry and the way we designed security model =
for
> Hive and Search where introducing the hierarchy wasn=92t complex at =
all
> and actually lead to a cleaner model. The biggest user visible benefit
> is that you don=92t have to deal with special rules such as =93give =
READ
> privilege to user jarcec to ALL topics=94. If you have a singleton =
parent
> entity (service or whatever name seems more accurate), you can easily
> say that you have the READ access on this root entity and then all
> topics will simply inherit that=10.
>=20
> Jarcec
>=20
> On Oct 1, 2014, at 9:33 PM, Jay Kreps <jay.kreps@gmail.com> wrote:
>=20
>> Hey Jarek,
>>=20
>> I agree with the importance of separating authentication and
>> authorization. The question is what concept of identity is sufficient
>> to pass through to the authorization layer? Just a "user name"? Or
>> perhaps you also need the ip the request originated from? Whatever
>> these would be it would be nice to enumerate them so the authz =
portion
>> can be written in a way that ignores the authn part.
>>=20
>> So if no one else proposes anything different maybe we can just say
>> user name + ip?
>>=20
>> With respect to hierarchy, it would be nice to have topic hierarchies
>> but we don't have them now so seems overkill to try to think them
>> through wrt security now, right?
>>=20
>> -Jay
>>=20
>>=20
>>=20
>> On Wed, Oct 1, 2014 at 1:13 PM, Jarek Jarcec Cecho =
<jarcec@apache.org> wrote:
>>> I=92m following the security proposal wiki page [1] and this =
discussion and I would like to jump in with few points if I might :)  =
Let me start by saying that I like the material and the discussion here, =
good work!
>>>=20
>>> I was part of the team who originally designed and worked on Sentry =
and I wanted to share few to see how it will resonate with people.  My =
first and probably biggest point would be to separate authorization and =
authentication as two separate systems. I believe that Jao has already =
stressed that in the email thread, but I wanted to reiterate on that =
point. In my experience users don=92t care that much about how the user =
has been authenticated if they trust that mechanism, what they care more =
about is that the authorization model is consistent and behaves the same =
way. E.g. if I configured that user jarcec can write into topic =93logs=94=
, he should be able to do that no matter where the connection came from =
- whether he has been authorized from Kerberos as he is directly =
exploring the data from his computer, he is authorized through =
delegation token because he is running map reduce jobs calculating =
statistics or he is  authorized through SSL certificated because =85 =
(well I=92m missing good example here, but you=92re probably following =
my point).
>>>=20
>>> I=92ve also noticed that we are planning to have no hierarchy in the =
authz object model per the wiki [1] with the reasoning that Kafka do not =
supports topic hierarchy. I see that point, but at the same time it got =
me thinking - are we sure that Kafka will never have hierarchic topics? =
Seems as a nice feature that might be usable for some use cases and =
something that we might want to add in the future. But regardless of =
that I would suggest to introduce a hierarchy anyway, even though if it =
would be just two levels. In sentry (for Hive) we=92ve introduced =
concept of =93Service=94 where all the databases are children of the =
service. In Kafka I would imagine that we would have =93service=94 and =
=93topics=94 as the children. Having this is much easier to model =
general privileges where you need to grant access to all topics - you =
will just grant access to the entire service and all topics will get =
=93inherited=94.
>>>=20
>>> I=92m wondering what are other people thoughts?
>>>=20
>>> Jarcec
>>>=20
>>> Links:
>>> 1: https://cwiki.apache.org/confluence/display/KAFKA/Security
>>>=20
>>> On Oct 1, 2014, at 9:44 AM, Joe Stein <joe.stein@stealth.ly> wrote:
>>>=20
>>>> Hi Jonathan,
>>>>=20
>>>> "Hadoop delegation tokens to enable MapReduce, Samza, or other =
frameworks
>>>> running in the Hadoop environment to access Kafka"
>>>> https://cwiki.apache.org/confluence/display/KAFKA/Security is on =
the list,
>>>> yup!
>>>>=20
>>>> /*******************************************
>>>> Joe Stein
>>>> Founder, Principal Consultant
>>>> Big Data Open Source Security LLC
>>>> http://www.stealth.ly
>>>> Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
>>>> ********************************************/
>>>>=20
>>>> On Wed, Oct 1, 2014 at 12:35 PM, Jonathan Creasy =
<Jonathan.Creasy@turn.com>
>>>> wrote:
>>>>=20
>>>>> This is not nearly as deep as the discussion so far, but I did =
want to
>>>>> throw this idea out there to make sure we=B9ve thought about it.
>>>>>=20
>>>>> The Kafka project should make sure that when deployed alongside a =
Hadoop
>>>>> cluster from any major distributions that it can tie seamlessly =
into the
>>>>> authentication and authorization used within that cluster. For =
example,
>>>>> Apache Sentry.
>>>>>=20
>>>>> This may present additional difficulties that means a decision is =
made to
>>>>> not do that or alternatively the Kerberos authentication and the
>>>>> authorization schemes we are already working on may be sufficient.
>>>>>=20
>>>>> I=B9m not sure that anything I=B9ve read so far in this discussion =
actually
>>>>> poses a problem, but I=B9m an Ops guy and being able to more =
easily
>>>>> integrate more things, makes my life better. :)
>>>>>=20
>>>>> -Jonathan
>>>>>=20
>>>>> On 9/30/14, 11:26 PM, "Joe Stein" <joe.stein@stealth.ly> wrote:
>>>>>=20
>>>>>> inline
>>>>>>=20
>>>>>> On Tue, Sep 30, 2014 at 11:58 PM, Jay Kreps <jay.kreps@gmail.com> =
wrote:
>>>>>>=20
>>>>>>> Hey Joe,
>>>>>>>=20
>>>>>>> For (1) what are you thinking for the PermissionManager api?
>>>>>>>=20
>>>>>>> The way I see it, the first question we have to answer is =
whether it
>>>>>>> is possible to make authentication and authorization =
independent. What
>>>>>>> I mean by that is whether I can write an authorization library =
that
>>>>>>> will work the same whether you authenticate with ssl or =
kerberos.
>>>>>>=20
>>>>>>=20
>>>>>> To me that is a requirement. We can't tie them together.  We have =
to
>>>>>> provide the ability for authorization to work regardless of the
>>>>>> authentication.  One *VERY* important use case is level of trust =
in
>>>>>> authentication from the authorization perpsective.  e.g. I =
authorize
>>>>>> "identity" based on the how you authenticated.... Alice is able =
to view
>>>>>> topic X if Alice authenticated over kerberos.  Bob isn't allowed =
to view
>>>>>> topic X no matter what. Alice can authenticate over not kerberos =
(uses
>>>>>> cases for that) and in that case Alice wouldn't see topic X.  A =
concrete
>>>>>> use case for this with Kafka would be a third party bank =
consuming data to
>>>>>> a broker.  The service provider would have some kerberos local =
auth for
>>>>>> that bank to-do back up that would also have access to other =
topics
>>>>>> related
>>>>>> to that banks data.... the bank itself over SSL wants a stream of =
events
>>>>>> (some specific topic) and that banks identity only sees that =
topic.  It is
>>>>>> important to not confuse identity, authentication and =
authorization.
>>>>>>=20
>>>>>>=20
>>>>>>> If
>>>>>>> so then we need to pick some subset of identity information that =
we
>>>>>>> can extract from both and have this constitute the identity we =
pass
>>>>>>> into the authorization interface. The original proposal had just =
the
>>>>>>> username/subject. But maybe we should add the ip address as well =
as
>>>>>>> that is useful. What I would prefer not to do is add everything =
in the
>>>>>>> certificate. I think the assumption is that you are generating =
these
>>>>>>> certificates for Kafka so you can put whatever identity info you =
want
>>>>>>> in the Subject Alternative Name. If that is true then just using =
that
>>>>>>> should be okay, right?
>>>>>>>=20
>>>>>>=20
>>>>>> I think we should just push the byte[] and let the plugin deal =
with it.
>>>>>> So, if we have a certificate object then pass that along with =
whatever
>>>>>> other meta data (e.g. IP address of client) we can.  I don't =
think we
>>>>>> should do any parsing whatsover and let the plugin deal with =
that.  Any
>>>>>> parsing we do on the identity information for the "security =
object" forces
>>>>>> us into specific implementations and I don't see any reason to-do =
that...
>>>>>> If plug-ins want an "easier" time to deal with certs and parsing =
and blah
>>>>>> blah blah then we can implement some way they can do this without =
much
>>>>>> fuss.... we also need to make sure that crypto library is =
plugable too (so
>>>>>> we can expose an API for them to call) so that HSM can be easily =
dropped
>>>>>> in
>>>>>> without Kafka caring... so in the plugin we could provide a
>>>>>> indentity.getAlternativeAttribute() and then that use case is =
solved (and
>>>>>> we can use bouncy castle or whatever to parse it for them to make =
it
>>>>>> easier).... and always give them raw bytes so they could do it =
themselves.
>>>>>>=20
>>>>>>=20
>>>>>>>=20
>>>>>>> -Jay
>>>>>>>=20
>>>>>>>=20
>>>>>>>=20
>>>>>>>=20
>>>>>>>=20
>>>>>>> On Tue, Sep 30, 2014 at 4:09 PM, Joe Stein =
<joe.stein@stealth.ly>
>>>>> wrote:
>>>>>>>> 1) We need to support the most flexibility we can and make this
>>>>>>> transparent
>>>>>>>> to kafka (to use Gwen's term).  Any specific implementation is =
going
>>>>>>> to
>>>>>>>> make it not work with some solution stopping people from using =
Kafka.
>>>>>>> That
>>>>>>>> is a reality because everyone just does it slightly differently
>>>>>>> enough.
>>>>>>> If
>>>>>>>> we have an "identity" byte structure (lets not use string =
because some
>>>>>>>> security objects are bytes) this should just fall through to =
the
>>>>>>>> implementor.  For certs this is the entire x509 object (not =
just the
>>>>>>>> certificate part as it could contain an ASN.1 timestamp) and =
inside
>>>>>>> you
>>>>>>>> parse and do what you want with it.
>>>>>>>>=20
>>>>>>>> 2) While I think there are many benefits to just the handshake
>>>>>>> approach I
>>>>>>>> don't think it outweighs the cons Jay expressed. a) We can't =
lead the
>>>>>>>> client libraries down a new path of interacting with Kafka.  By
>>>>>>>> incrementally adding to the wire protocol we are directing a =
very
>>>>>>> clear
>>>>>>> and
>>>>>>>> expect ted approach.  We already have issues with =
implementation even
>>>>>>> with
>>>>>>>> the wire protocol in place and are trying to improve that =
aspect of
>>>>>>> the
>>>>>>>> community as a whole.  Lets not take a step backwards with this
>>>>>>> there...
>>>>>>>> also we need to not add more/different hoops to
>>>>>>>> debugging/administering/monitoring kafka so taking advantage =
(as Jay
>>>>>>> says)
>>>>>>>> of built in logging (etc) is important... also for the client =
librariy
>>>>>>>> developers too :)
>>>>>>>>=20
>>>>>>>> On Tue, Sep 30, 2014 at 6:44 PM, Gwen Shapira =
<gshapira@cloudera.com>
>>>>>>> wrote:
>>>>>>>>=20
>>>>>>>>> Re #1:
>>>>>>>>>=20
>>>>>>>>> Since the auth_to_local is a kerberos config, its up to the =
admin to
>>>>>>>>> decide how he likes the user names and set it up properly (or =
leave
>>>>>>>>> empty) and make sure the ACLs match. Simplified names may be =
needed
>>>>>>> if
>>>>>>>>> the authorization system integrates with LDAP to get groups or
>>>>>>>>> something fancy like that.
>>>>>>>>>=20
>>>>>>>>> Note that its completely transparent to Kafka - if the admin =
sets up
>>>>>>>>> auth_to_local rules, we simply see a different principal name. =
No
>>>>>>> need
>>>>>>>>> to do anything different.
>>>>>>>>>=20
>>>>>>>>> Gwen
>>>>>>>>>=20
>>>>>>>>> On Tue, Sep 30, 2014 at 3:31 PM, Jay Kreps =
<jay.kreps@gmail.com>
>>>>>>> wrote:
>>>>>>>>>> Current proposal is here:
>>>>>>>>>>=20
>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/Security
>>>>>>>>>>=20
>>>>>>>>>> Here are the two open questions I am aware of:
>>>>>>>>>>=20
>>>>>>>>>> 1. We want to separate authentication and authorization. This =
means
>>>>>>>>>> permissions will be assigned to some user-like
>>>>>>> subject/entity/person
>>>>>>>>>> string that is independent of the authorization mechanism. It
>>>>>>> sounds
>>>>>>>>>> like we agreed this could be done and we had in mind some
>>>>>>> krb-specific
>>>>>>>>>> mangling that Gwen knew about and I think the plan was to use
>>>>>>> whatever
>>>>>>>>>> the user chose to put in the Subject Alternative Name of the =
cert
>>>>>>> for
>>>>>>>>>> ssl. So in both cases these would translate to a string =
denoting
>>>>>>> the
>>>>>>>>>> entity whom we are granting permissions to in the =
authorization
>>>>>>> layer.
>>>>>>>>>> We should document these in the wiki to get feedback on them.
>>>>>>>>>>=20
>>>>>>>>>> The Hadoop approach to extraction was something like this:
>>>>>>>>>>=20
>>>>>>>>>=20
>>>>>>>=20
>>>>>>>=20
>>>>> =
http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.1/bk_installing_man
>>>>>>> ually_book/content/rpm-chap14-2-3-1.html
>>>>>>>>>>=20
>>>>>>>>>> But actually I'm not sure if just using the full kerberos
>>>>>>> principal is
>>>>>>>>>> so bad? I.e. having the user be jennifer@athena.mit.edu =
versus
>>>>> just
>>>>>>>>>> jennifer. Where this would make a difference would be in a =
case
>>>>>>> where
>>>>>>>>>> you wanted the same user/entity to be able to authenticate =
via
>>>>>>>>>> different mechanisms (Hadoop auth, kerberos, ssl) and have a =
single
>>>>>>>>>> set of permissions.
>>>>>>>>>>=20
>>>>>>>>>> 2. For SASL/Kerberos we need to figure out how the =
communication
>>>>>>>>>> between client and server will be handled to pass the
>>>>>>>>>> challenge/response byte[]. I.e.
>>>>>>>>>>=20
>>>>>>>>>>=20
>>>>>>>>>=20
>>>>>>>=20
>>>>>>>=20
>>>>> =
http://docs.oracle.com/javase/7/docs/api/javax/security/sasl/SaslClient.h
>>>>>>> tml#evaluateChallenge(byte[])
>>>>>>>>>>=20
>>>>>>>>>=20
>>>>>>>=20
>>>>>>>=20
>>>>> =
http://docs.oracle.com/javase/7/docs/api/javax/security/sasl/SaslServer.h
>>>>>>> tml#evaluateResponse(byte[])
>>>>>>>>>>=20
>>>>>>>>>> I am not super expert in this area but I will try to give my
>>>>>>>>>> understanding and I'm sure someone can correct me if I am =
confused.
>>>>>>>>>>=20
>>>>>>>>>> Unlike SSL the transmission of this is actually outside the =
scope
>>>>>>> of
>>>>>>>>>> SASL so we have to specify this. Two proposals
>>>>>>>>>>=20
>>>>>>>>>> Original Proposal: Add a new "authenticate" request/response
>>>>>>>>>>=20
>>>>>>>>>> The proposal in the original wiki was to add a new =
"authenticate"
>>>>>>>>>> request/response to pass this information. This matches what =
was
>>>>>>> done
>>>>>>>>>> in the kerberos implementation for zookeeper. The intention =
is that
>>>>>>>>>> the client would send this request immediately after =
establishing a
>>>>>>>>>> connection, in which case it acts much like a "handshake", =
however
>>>>>>>>>> there is no requirement that they do so.
>>>>>>>>>>=20
>>>>>>>>>> Whether the authentication happens via SSL or via Kerberos, =
the
>>>>>>> effect
>>>>>>>>>> will just be to set the username in their session. This will
>>>>>>> default
>>>>>>>>>> to the "anybody" user. So in the default non-secure case we =
will
>>>>>>> just
>>>>>>>>>> be defaulting "anybody" to have full permission. So to answer =
the
>>>>>>>>>> question about whether changing user is required or not, I =
don't
>>>>>>> think
>>>>>>>>>> it is but I think we kind of get it for free in this =
approach.
>>>>>>>>>>=20
>>>>>>>>>> In this approach there is no particular need or advantage to
>>>>>>> having a
>>>>>>>>>> separate port for kerberos I don't think.
>>>>>>>>>>=20
>>>>>>>>>> Alternate Proposal: Create a Handshake
>>>>>>>>>>=20
>>>>>>>>>> The alternative I think Michael was proposing was to create a
>>>>>>>>>> handshake that would happen at connection time on connections
>>>>>>> coming
>>>>>>>>>> in on the SASL port. This would require a separate port for =
SASL
>>>>>>> since
>>>>>>>>>> otherwise you wouldn't be able to tell if the bytes you were
>>>>>>> getting
>>>>>>>>>> were for SASL or were the first request of an unauthenticated
>>>>>>>>>> connection.
>>>>>>>>>>=20
>>>>>>>>>> Michael it would be good to work out the details of how this =
works.
>>>>>>>>>> Are we just sending size-delimited byte arrays back and forth =
until
>>>>>>>>>> the challenge response terminates?
>>>>>>>>>>=20
>>>>>>>>>> My Take
>>>>>>>>>>=20
>>>>>>>>>> The pro I see for Michael's proposal is that it keeps the
>>>>>>>>>> authentication logic more localized in the socket server.
>>>>>>>>>>=20
>>>>>>>>>> I see two cons:
>>>>>>>>>> 1. Since the handshake won't go through the normal api layer =
it
>>>>>>> won't
>>>>>>>>>> go through the normal logging (e.g. request log), jmx =
monitoring,
>>>>>>>>>> client trace token, correlation id, etc that we get for other
>>>>>>>>>> requests. This could make operations a little confusing and =
make
>>>>>>>>>> debugging a little harder since the client will be blocking =
on
>>>>>>> network
>>>>>>>>>> requests without the normal logging.
>>>>>>>>>> 2. This part of the protocol will be inconsistent with the =
rest of
>>>>>>> the
>>>>>>>>>> Kafka protocol so it will be a little odd for client =
implementors
>>>>>>> as
>>>>>>>>>> this will effectively be a request/response that they will =
have to
>>>>>>>>>> implement that will be different from all the other
>>>>>>> request/responses
>>>>>>>>>> they implement.
>>>>>>>>>>=20
>>>>>>>>>> In practice these two alternatives are not very different =
except
>>>>>>> that
>>>>>>>>>> in the original proposal the bytes you send are prefixed by =
the
>>>>>>> normal
>>>>>>>>>> request header fields such as the client id, correlation id, =
etc.
>>>>>>>>>> Overall I would prefer this as I think it is a bit more =
consistent
>>>>>>>>>> from the client's point of view.
>>>>>>>>>>=20
>>>>>>>>>> Cheers,
>>>>>>>>>>=20
>>>>>>>>>> -Jay
>>>>>>>>>=20
>>>>>>>=20
>>>>>=20
>>>>>=20
>>>=20
>=20