Mailing-List: contact hdfs-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-dev@hadoop.apache.org
MIME-Version: 1.0
In-Reply-To: <D2D6FACE.BD59%uma.gangumalla@intel.com>
References: <D2C329DA.A315%uma.gangumalla@intel.com>
 <CAGB5D2av2Y6-S1Mk2iXhv4GPtUae9-H8DyhkJFAx19CpZdcv9A@mail.gmail.com>
 <D2C58B12.A5C9%uma.gangumalla@intel.com>
 <8D5F7E3237B3ED47B84CF187BB17B66614875052@SHSMSX103.ccr.corp.intel.com>
 <D2C679F2.3892A%cnauroth@hortonworks.com>
 <D2C6B87F.A6DD%uma.gangumalla@intel.com>
 <8D5F7E3237B3ED47B84CF187BB17B66614875B9C@SHSMSX103.ccr.corp.intel.com>
 <3E657120E422654A9EB626F537B8AA91140F4AF3@shsmsx102.ccr.corp.intel.com>
 <3E657120E422654A9EB626F537B8AA91140F59A8@shsmsx102.ccr.corp.intel.com>
 <CAHfHakFhBLjjmP1QkZUa2VF0xQ=yq-1w18o5mYktzbvQV24T1A@mail.gmail.com>
 <D2CE4204.AC5F%uma.gangumalla@intel.com>
 <CAHfHakESt0q2q2YO8CgB+vzUgFUpAJZp+P_CXTO0wB8k+S66_g@mail.gmail.com>
 <CA+4052==zd+DYATHZpJUagj04d1LJb6uPzG1SCsL+cnPAa=X0A@mail.gmail.com>
 <3E657120E422654A9EB626F537B8AA91140F7F34@shsmsx102.ccr.corp.intel.com>
 <3E657120E422654A9EB626F537B8AA91140F8D8E@shsmsx102.ccr.corp.intel.com>
 <CACO5Y4xyrZMonvx_GrbpW85A7vtxCksZGkJDwkmncBsCXooRHg@mail.gmail.com>
 <CA+qbEUONkm4HCD8qaM=rPn9Kb4hcEA-oPRTtn4yeVDztqjJcfg@mail.gmail.com>
 <3E657120E422654A9EB626F537B8AA91140F98DE@shsmsx102.ccr.corp.intel.com>
 <D2D6FACE.BD59%uma.gangumalla@intel.com>
From: Chris Douglas <cdouglas@apache.org>
Date: Wed, 3 Feb 2016 01:49:23 -0800
Message-ID: 
 <CACO5Y4zwK+54wbGoD0dYXuM-FfOPVcn0ZyPGbgQXzy==fc_vaQ@mail.gmail.com>
Subject: Re: Hadoop encryption module as Apache Chimera incubator project
To: "hdfs-dev@hadoop.apache.org" <hdfs-dev@hadoop.apache.org>
Content-Type: text/plain; charset=UTF-8

On Wed, Feb 3, 2016 at 12:48 AM, Gangumalla, Uma
<uma.gangumalla@intel.com> wrote:
>>Standing in the point of shared fundamental piece of code like this, I do
>>think Apache Commons might be the best direction which we can try as the
>>first effort. In this direction, we still need to work with Apache Common
>>community for buying in and accepting the proposal.
> Make sense.

Makes sense how?

> For this we should define the independent release cycles for this project
> and it would just place under Hadoop tree if we all conclude with this
> option at the end.

Yes.

> [Chris]
>>If Chimera is not successful as an independent project or stalls,
>>Hadoop and/or Spark and/or $project will have to reabsorb it as
>>maintainers.
>>
> I am not so strong on this point. If we assume project would be
> unsuccessful, it can be unsuccessful(less maintained) even under hadoop.
> But if other projects depending on this piece then they would get less
> support. Of course right now we feel this piece of code is very important
> and we feel(expect) it can be successful as independent project,
> irrespective of whether it as separate project outside hadoop or inside.
> So, I feel this point would not really influence to judge the discussion.

Sure; code can idle anywhere, but that wasn't the point I was after.
You propose to extract code from Hadoop, but if Chimera fails then
what recourse do we have among the other projects taking a dependency
on it? Splitting off another project is feasible, but Chimera should
be sustainable before this PMC can divest itself of responsibility for
security libraries. That's a pretty low bar.

Bundling the library with the jar is helpful; I've used that before.
It should prefer (updated) libraries from the environment, if
configured. Otherwise it's a pain (or impossible) for ops to patch
security bugs. -C

>>-----Original Message-----
>>From: Colin P. McCabe [mailto:cmccabe@apache.org]
>>Sent: Wednesday, February 3, 2016 4:56 AM
>>To: hdfs-dev@hadoop.apache.org
>>Subject: Re: Hadoop encryption module as Apache Chimera incubator project
>>
>>It's great to see interest in improving this functionality.  I think
>>Chimera could be successful as an Apache project.  I don't have a strong
>>opinion one way or the other as to whether it belongs as part of Hadoop
>>or separate.
>>
>>I do think there will be some challenges splitting this functionality out
>>into a separate jar, because of the way our CLASSPATH works right now.
>>For example, let's say that Hadoop depends on Chimera 1.2 and Spark
>>depends on Chimera 1.1.  Now Spark jobs have two different versions
>>fighting it out on the classpath, similar to the situation with Guava and
>>other libraries.  Perhaps if Chimera adopts a policy of strong backwards
>>compatibility, we can just always use the latest jar, but it still seems
>>likely that there will be problems.  There are various classpath
>>isolation ideas that could help here, but they are big projects in their
>>own right and we don't have a clear timeline for them.  If this does end
>>up being a separate jar, we may need to shade it to avoid all these
>>issues.
>>
>>Bundling the JNI glue code in the jar itself is an interesting idea,
>>which we have talked about before for libhadoop.so.  It doesn't really
>>have anything to do with the question of TLP vs. non-TLP, of course.
>>We could do that refactoring in Hadoop itself.  The really complicated
>>part of bundling JNI code in a jar is that you need to create jars for
>>every cross product of (JVM version, openssl version, operating system).
>>For example, you have the RHEL6 build for openJDK7 using openssl 1.0.1e.
>>If you change any one thing-- say, change openJDK7 to Oracle JDK8, then
>>you might need to rebuild.  And certainly using Ubuntu would be a
>>rebuild.  And so forth.  This kind of clashes with Maven's philosophy of
>>pulling prebuilt jars from the internet.
>>
>>Kai Zheng's question about whether we would bundle openSSL's libraries is
>>a good one.  Given the high rate of new vulnerabilities discovered in
>>that library, it seems like bundling would require Hadoop users and
>>vendors to update very frequently, much more frequently than Hadoop is
>>traditionally updated.  So probably we would not choose to bundle openssl.
>>
>>best,
>>Colin
>>
>>On Tue, Feb 2, 2016 at 12:29 AM, Chris Douglas <cdouglas@apache.org>
>>wrote:
>>> As a subproject of Hadoop, Chimera could maintain its own cadence.
>>> There's also no reason why it should maintain dependencies on other
>>> parts of Hadoop, if those are separable. How is this solution
>>> inadequate?
>>>
>>> If Chimera is not successful as an independent project or stalls,
>>> Hadoop and/or Spark and/or $project will have to reabsorb it as
>>> maintainers. Projects have high mortality in early life, and a fight
>>> over inheritance/maintenance is something we'd like to avoid. If, on
>>> the other hand, it develops enough of a community where it is
>>> obviously viable, then we can (and should) break it out as a TLP (as
>>> we have before). If other Apache projects take a dependency on
>>> Chimera, we're open to adding them to security@hadoop.
>>>
>>> Unlike Yetus, which was largely rewritten right before it was made
>>> into a TLP, security in Hadoop has a complicated pedigree. If Chimera
>>> eventually becomes a TLP, it seems fair to include those who work on
>>> it while it is a subproject. Declared upfront, that criterion is
>>> fairer than any post hoc justification, and will lead to a more
>>> accurate account of its community than a subset of the Hadoop
>>> PMC/committers that volunteer. -C
>>>
>>>
>>> On Mon, Feb 1, 2016 at 9:29 PM, Chen, Haifeng <haifeng.chen@intel.com>
>>>wrote:
>>>> Thanks to all folks providing feedbacks and participating the
>>>>discussions.
>>>>
>>>> @Owen, do you still have any concerns on going forward in the
>>>>direction of Apache Commons (or other options, TLP)?
>>>>
>>>> Thanks,
>>>> Haifeng
>>>>
>>>> -----Original Message-----
>>>> From: Chen, Haifeng [mailto:haifeng.chen@intel.com]
>>>> Sent: Saturday, January 30, 2016 10:52 AM
>>>> To: hdfs-dev@hadoop.apache.org
>>>> Subject: RE: Hadoop encryption module as Apache Chimera incubator
>>>> project
>>>>
>>>>>> I believe encryption is becoming a core part of Hadoop. I think
>>>>>> that moving core components out of Hadoop is bad from a project
>>>>>>management perspective.
>>>>
>>>>> Although it's certainly true that encryption capabilities (in HDFS,
>>>>>YARN, etc.) are becoming core to Hadoop, I don't think that should
>>>>>really influence whether or not the non-Hadoop-specific encryption
>>>>>routines should be part of the Hadoop code base, or part of the code
>>>>>base of another project that Hadoop depends on. If Chimera had existed
>>>>>as a library hosted at ASF when HDFS encryption was first developed,
>>>>>HDFS probably would have just added that as a dependency and been done
>>>>>with it. I don't think we would've copy/pasted the code for Chimera
>>>>>into the Hadoop code base.
>>>>
>>>> Agree with ATM. I want to also make an additional clarification. I
>>>>agree that the encryption capabilities are becoming core to Hadoop.
>>>>While this effort is to put common and shared encryption routines such
>>>>as crypto stream implementations into a scope which can be widely
>>>>shared across the Apache ecosystem. This doesn't move Hadoop encryption
>>>>out of Hadoop (that is not possible).
>>>>
>>>> Agree if we make it a separate and independent releases project in
>>>>Hadoop takes a step further than the existing approach and solve some
>>>>issues (such as libhadoop.so problem). Frankly speaking, I think it is
>>>>not the best option we can try. I also expect that an independent
>>>>release project within Hadoop core will also complicate the existing
>>>>release ideology of Hadoop release.
>>>>
>>>> Thanks,
>>>> Haifeng
>>>>
>>>> -----Original Message-----
>>>> From: Aaron T. Myers [mailto:atm@cloudera.com]
>>>> Sent: Friday, January 29, 2016 9:51 AM
>>>> To: hdfs-dev@hadoop.apache.org
>>>> Subject: Re: Hadoop encryption module as Apache Chimera incubator
>>>> project
>>>>
>>>> On Wed, Jan 27, 2016 at 11:31 AM, Owen O'Malley <omalley@apache.org>
>>>>wrote:
>>>>
>>>>> I believe encryption is becoming a core part of Hadoop. I think that
>>>>> moving core components out of Hadoop is bad from a project management
>>>>>perspective.
>>>>>
>>>>
>>>> Although it's certainly true that encryption capabilities (in HDFS,
>>>> YARN,
>>>> etc.) are becoming core to Hadoop, I don't think that should really
>>>>influence whether or not the non-Hadoop-specific encryption routines
>>>>should be part of the Hadoop code base, or part of the code base of
>>>>another project that Hadoop depends on. If Chimera had existed as a
>>>>library hosted at ASF when HDFS encryption was first developed, HDFS
>>>>probably would have just added that as a dependency and been done with
>>>>it. I don't think we would've copy/pasted the code for Chimera into the
>>>>Hadoop code base.
>>>>
>>>>
>>>>> To put it another way, a bug in the encryption routines will likely
>>>>> become a security problem that security@hadoop needs to hear about.
>>>>>
>>>> I don't think
>>>>> adding a separate project in the middle of that communication chain
>>>>> is a good idea. The same applies to data corruption problems, and so
>>>>>on...
>>>>>
>>>>
>>>> Isn't the same true of all the libraries that Hadoop currently depends
>>>>upon? If the commons-httpclient library (or commons-codec, or
>>>>commons-io, or guava, or...) has a security vulnerability, we need to
>>>>know about it so that we can update our dependency to a fixed version.
>>>>This case doesn't seem materially different than that.
>>>>
>>>>
>>>>>
>>>>>
>>>>> > It may be good to keep at generalized place(As in the discussion,
>>>>> > we thought that place could be Apache Commons).
>>>>>
>>>>>
>>>>> Apache Commons is a collection of *Java* projects, so Chimera as a
>>>>> JNI-based library isn't a natural fit.
>>>>>
>>>>
>>>> Could very well be that Apache Commons's charter would preclude
>>>>Chimera.
>>>> You probably know better than I do about that.
>>>>
>>>>
>>>>> Furthermore, Apache Commons doesn't
>>>>> have its own security list so problems will go to the generic
>>>>> security@apache.org.
>>>>>
>>>>
>>>> That seems easy enough to remedy, if they wanted to, and besides I'm
>>>>not sure why that would influence this discussion. In my experience
>>>>projects that don't have a separate security@project.a.o mailing list
>>>>tend to just handle security issues on their private@project.a.o
>>>>mailing list, which seems fine to me.
>>>>
>>>>
>>>>>
>>>>> Why do you think that Apache Commons is a better home than Hadoop?
>>>>>
>>>>
>>>> I'm certainly not at all wedded to Apache Commons, that just seemed
>>>>like a natural place to put it to me. Could be that a brand new TLP
>>>>might make more sense.
>>>>
>>>> I *do* think that if other non-Hadoop projects want to make use of
>>>>Chimera, which as I understand it is the goal which started this
>>>>thread, then Chimera should exist outside of Hadoop so that:
>>>>
>>>> a) Projects that have nothing to do with Hadoop can just depend
>>>>directly on Chimera, which has nothing Hadoop-specific in there.
>>>>
>>>> b) The Hadoop project doesn't have to export/maintain/concern itself
>>>>with yet another publicly-consumed interface.
>>>>
>>>> c) Chimera can have its own (presumably much faster) release cadence
>>>>completely separate from Hadoop.
>>>>
>>>> --
>>>> Aaron T. Myers
>>>> Software Engineer, Cloudera
>