hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gangumalla, Uma" <uma.ganguma...@intel.com>
Subject Re: Hadoop encryption module as Apache Chimera incubator project
Date Fri, 12 Feb 2016 02:19:26 GMT
Thanks Haifeng. I was just waiting if any more comments. If no objections
further, I would initiate a discussion thread in Apache Commons in a day
time and will also cc to hadoop common.

Regards,
Uma

On 2/11/16, 6:13 PM, "Chen, Haifeng" <haifeng.chen@intel.com> wrote:

>Thanks all the folks participating this discussion and providing valuable
>suggestions and options.
>
>I suggest we take it forward to make a proposal in Apache Commons
>community. 
>
>Thanks,
>Haifeng
>
>-----Original Message-----
>From: Chen, Haifeng [mailto:haifeng.chen@intel.com]
>Sent: Friday, February 5, 2016 10:06 AM
>To: hdfs-dev@hadoop.apache.org; common-dev@hadoop.apache.org
>Subject: RE: Hadoop encryption module as Apache Chimera incubator project
>
>> [Chirs] Yes, but even if the artifact is widely consumed, as a TLP it
>>would need to sustain a community. If the scope is too narrow, then it
>>will quickly fall into maintenance mode, its contributors will move on,
>>and it will retire to the attic. Alone, I doubt its viability as a TLP.
>>So as a first option, donating only this code to Apache Commons would
>>accomplish some immediate goals in a sustainable forum.
>Totally agree. As a TLP it needs nice scope and roadmap to sustain a
>development community.
>
>Thanks,
>Haifeng
>
>-----Original Message-----
>From: Chris Douglas [mailto:cdouglas@apache.org]
>Sent: Friday, February 5, 2016 6:28 AM
>To: common-dev@hadoop.apache.org
>Cc: hdfs-dev@hadoop.apache.org
>Subject: Re: Hadoop encryption module as Apache Chimera incubator project
>
>On Thu, Feb 4, 2016 at 12:06 PM, Gangumalla, Uma
><uma.gangumalla@intel.com> wrote:
>
>> [UMA] Ok. Great. You are right. I have cc¹ed to hadoop common. (You
>> mean to cc Apache commons as well?)
>
>I meant, if you start a discussion with Apache Commons, please CC
>common-dev@hadoop to coordinate.
>
>> [UMA] Right now we plan to have encryption libraries are the only
>> one¹s we planned and as we see lot of interest from other projects
>> like spark to use them. I see some challenges when we bring lot of
>> code(other common
>> codes) into this project is that, they all would have different
>> requirements and may be different expected timelines for release etc.
>> Some projects may just wanted to use encryption interfaces alone but
>>not all.
>> As they are completely independent codes, may be better to scope out
>> clearly.
>
>Yes, but even if the artifact is widely consumed, as a TLP it would need
>to sustain a community. If the scope is too narrow, then it will quickly
>fall into maintenance mode, its contributors will move on, and it will
>retire to the attic. Alone, I doubt its viability as a TLP. So as a first
>option, donating only this code to Apache Commons would accomplish some
>immediate goals in a sustainable forum.
>
>APR has a similar scope. As a second option, that may also be a
>reasonable home, particularly if some of the native bits could integrate
>with APR.
>
>If the scope is broader, the effort could sustain prolonged development.
>The current code is developing a strategy for packing native libraries on
>multiple platforms, a capability that, say, the native compression codecs
>(AFAIK) still lack. While java.nio is improving, many projects would
>benefit from a better, native interface to the filesystem (e.g.,
>NativeIO). We could avoid duplicating effort and collaborate on a common
>library.
>
>As a third option, Hadoop already implements some useful native
>libraries, which is why a subproject might be a sound course. That would
>enable the subproject to coordinate with Hadoop on migrating its native
>functionality to a separable, reusable component, then move to a TLP when
>we can rely on it exclusively (if it has a well-defined, independent
>community). It could control its release cadence and limit its
>dependencies.
>
>Finally, this is beside the point if nobody is interested in doing the
>work on such a project. It's rude to pull code out of Hadoop and donate
>it to another project so Spark can avoid a dependency, but this instance
>seems reasonable to me. -C
>
>[1] https://apr.apache.org/
>
>> On 2/3/16, 6:46 PM, "Chen, Haifeng" <haifeng.chen@intel.com> wrote:
>>
>>>Thanks Chris.
>>>
>>>>> I went through the repository, and now understand the reasoning
>>>>>that would locate this code in Apache Commons. This isn't proposing
>>>>>to extract much of the implementation and it takes none of the
>>>>>integration. It's limited to interfaces to crypto libraries and
>>>>>streams/configuration.
>>>Exactly.
>>>
>>>>> Chimera would be a boutique TLP, unless we wanted to draw out more
>>>>>of the integration and tooling. Is that a goal you're interested in
>>>>>pursuing? There's a tension between keeping this focused and
>>>>>including enough functionality to make it viable as an independent
>>>>>component.
>>>The Chimera goal was for providing useful, common and optimized
>>>cryptographic functionalities. I would prefer that it is still focused
>>>in this clear scope. Multiple domain requirements will put more
>>>challenges and uncertainties in where and how it should go, thus more
>>>risk in stalling.
>>>
>>>>> If the encryption libraries are the only ones you're interested in
>>>>>pulling out, then Apache Commons does seem like a better target than
>>>>>a separate project.
>>>Yes. Just mentioned above, the library will be positioned in
>>>cryptographic.
>>>
>>>
>>>Thanks,
>>>
>>>-----Original Message-----
>>>From: Chris Douglas [mailto:cdouglas@apache.org]
>>>Sent: Thursday, February 4, 2016 7:26 AM
>>>To: hdfs-dev@hadoop.apache.org
>>>Subject: Re: Hadoop encryption module as Apache Chimera incubator
>>>project
>>>
>>>I went through the repository, and now understand the reasoning that
>>>would locate this code in Apache Commons. This isn't proposing to
>>>extract much of the implementation and it takes none of the
>>>integration. It's limited to interfaces to crypto libraries and
>>>streams/configuration. It might be a reasonable fit for commons-codec,
>>>but that's a pretty sparse library and driving the release cadence
>>>might be more complicated. It'd be worth discussing on their lists
>>>(please also CC common-dev@).
>>>
>>>Chimera would be a boutique TLP, unless we wanted to draw out more of
>>>the integration and tooling. Is that a goal you're interested in
>>>pursuing?
>>>There's a tension between keeping this focused and including enough
>>>functionality to make it viable as an independent component. By way of
>>>example, Hadoop's common project requires too many dependencies and
>>>carries too much historical baggage for other projects to rely on.
>>>I agree with Colin/Steve: we don't want this to grow into another
>>>guava-like dependency that creates more work in conflicts than it
>>>saves in implementation...
>>>
>>>Would it make sense to also package some of the compression libraries,
>>>and maybe some of the text processing from MapReduce? Evolving some of
>>>this code to a common library with few/no dependencies would be
>>>generally useful. As a subproject, it could have a broader scope that
>>>could evolve into a viable TLP. If the encryption libraries are the
>>>only ones you're interested in pulling out, then Apache Commons does
>>>seem like a better target than a separate project. -C
>>>
>>>
>>>On Wed, Feb 3, 2016 at 1:49 AM, Chris Douglas <cdouglas@apache.org>
>>>wrote:
>>>> On Wed, Feb 3, 2016 at 12:48 AM, Gangumalla, Uma
>>>> <uma.gangumalla@intel.com> wrote:
>>>>>>Standing in the point of shared fundamental piece of code like
>>>>>>this, I do think Apache Commons might be the best direction which
>>>>>>we can try as the first effort. In this direction, we still need to
>>>>>>work with Apache Common community for buying in and accepting the
>>>>>>proposal.
>>>>> Make sense.
>>>>
>>>> Makes sense how?
>>>>
>>>>> For this we should define the independent release cycles for this
>>>>> project and it would just place under Hadoop tree if we all
>>>>> conclude with this option at the end.
>>>>
>>>> Yes.
>>>>
>>>>> [Chris]
>>>>>>If Chimera is not successful as an independent project or stalls,
>>>>>>Hadoop and/or Spark and/or $project will have to reabsorb it as
>>>>>>maintainers.
>>>>>>
>>>>> I am not so strong on this point. If we assume project would be
>>>>>unsuccessful, it can be unsuccessful(less maintained) even under
>>>>>hadoop.
>>>>> But if other projects depending on this piece then they would get
>>>>>less support. Of course right now we feel this piece of code is very
>>>>>important and we feel(expect) it can be successful as independent
>>>>>project, irrespective of whether it as separate project outside
>>>>>hadoop or inside.
>>>>> So, I feel this point would not really influence to judge the
>>>>>discussion.
>>>>
>>>> Sure; code can idle anywhere, but that wasn't the point I was after.
>>>> You propose to extract code from Hadoop, but if Chimera fails then
>>>> what recourse do we have among the other projects taking a
>>>> dependency on it? Splitting off another project is feasible, but
>>>> Chimera should be sustainable before this PMC can divest itself of
>>>> responsibility for security libraries. That's a pretty low bar.
>>>>
>>>> Bundling the library with the jar is helpful; I've used that before.
>>>> It should prefer (updated) libraries from the environment, if
>>>> configured. Otherwise it's a pain (or impossible) for ops to patch
>>>> security bugs. -C
>>>>
>>>>>>-----Original Message-----
>>>>>>From: Colin P. McCabe [mailto:cmccabe@apache.org]
>>>>>>Sent: Wednesday, February 3, 2016 4:56 AM
>>>>>>To: hdfs-dev@hadoop.apache.org
>>>>>>Subject: Re: Hadoop encryption module as Apache Chimera incubator
>>>>>>project
>>>>>>
>>>>>>It's great to see interest in improving this functionality.  I
>>>>>>think Chimera could be successful as an Apache project.  I don't
>>>>>>have a strong opinion one way or the other as to whether it belongs
>>>>>>as part of Hadoop or separate.
>>>>>>
>>>>>>I do think there will be some challenges splitting this
>>>>>>functionality out into a separate jar, because of the way our
>>>>>>CLASSPATH works right now.
>>>>>>For example, let's say that Hadoop depends on Chimera 1.2 and Spark
>>>>>>depends on Chimera 1.1.  Now Spark jobs have two different versions
>>>>>>fighting it out on the classpath, similar to the situation with
>>>>>>Guava and other libraries.  Perhaps if Chimera adopts a policy of
>>>>>>strong backwards compatibility, we can just always use the latest
>>>>>>jar, but it still seems likely that there will be problems.  There
>>>>>>are various classpath isolation ideas that could help here, but
>>>>>>they are big projects in their own right and we don't have a clear
>>>>>>timeline for them.  If this does end up being a separate jar, we
>>>>>>may need to shade it to avoid all these issues.
>>>>>>
>>>>>>Bundling the JNI glue code in the jar itself is an interesting
>>>>>>idea, which we have talked about before for libhadoop.so.  It
>>>>>>doesn't really have anything to do with the question of TLP vs.
>>>>>>non-TLP, of course.
>>>>>>We could do that refactoring in Hadoop itself.  The really
>>>>>>complicated part of bundling JNI code in a jar is that you need to
>>>>>>create jars for every cross product of (JVM version, openssl
>>>>>>version, operating system).
>>>>>>For example, you have the RHEL6 build for openJDK7 using openssl
>>>>>>1.0.1e.
>>>>>>If you change any one thing-- say, change openJDK7 to Oracle JDK8,
>>>>>>then you might need to rebuild.  And certainly using Ubuntu would
>>>>>>be a rebuild.  And so forth.  This kind of clashes with Maven's
>>>>>>philosophy of pulling prebuilt jars from the internet.
>>>>>>
>>>>>>Kai Zheng's question about whether we would bundle openSSL's
>>>>>>libraries is a good one.  Given the high rate of new
>>>>>>vulnerabilities discovered in that library, it seems like bundling
>>>>>>would require Hadoop users and vendors to update very frequently,
>>>>>>much more frequently than Hadoop is traditionally updated.  So
>>>>>>probably we would not choose to bundle openssl.
>>>>>>
>>>>>>best,
>>>>>>Colin
>>>>>>
>>>>>>On Tue, Feb 2, 2016 at 12:29 AM, Chris Douglas
>>>>>><cdouglas@apache.org>
>>>>>>wrote:
>>>>>>> As a subproject of Hadoop, Chimera could maintain its own cadence.
>>>>>>> There's also no reason why it should maintain dependencies on
>>>>>>> other parts of Hadoop, if those are separable. How is this
>>>>>>> solution inadequate?
>>>>>>>
>>>>>>> If Chimera is not successful as an independent project or stalls,
>>>>>>> Hadoop and/or Spark and/or $project will have to reabsorb it
as
>>>>>>> maintainers. Projects have high mortality in early life, and
a
>>>>>>> fight over inheritance/maintenance is something we'd like to
avoid.
>>>>>>> If, on the other hand, it develops enough of a community where
it
>>>>>>> is obviously viable, then we can (and should) break it out as
a
>>>>>>> TLP (as we have before). If other Apache projects take a
>>>>>>> dependency on Chimera, we're open to adding them to
>>>>>>>security@hadoop.
>>>>>>>
>>>>>>> Unlike Yetus, which was largely rewritten right before it was
>>>>>>> made into a TLP, security in Hadoop has a complicated pedigree.
>>>>>>> If Chimera eventually becomes a TLP, it seems fair to include
>>>>>>> those who work on it while it is a subproject. Declared upfront,
>>>>>>> that criterion is fairer than any post hoc justification, and
>>>>>>> will lead to a more accurate account of its community than a
>>>>>>> subset of the Hadoop PMC/committers that volunteer. -C
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 1, 2016 at 9:29 PM, Chen, Haifeng
>>>>>>><haifeng.chen@intel.com>
>>>>>>>wrote:
>>>>>>>> Thanks to all folks providing feedbacks and participating
the
>>>>>>>>discussions.
>>>>>>>>
>>>>>>>> @Owen, do you still have any concerns on going forward in
the
>>>>>>>>direction of Apache Commons (or other options, TLP)?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Haifeng
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Chen, Haifeng [mailto:haifeng.chen@intel.com]
>>>>>>>> Sent: Saturday, January 30, 2016 10:52 AM
>>>>>>>> To: hdfs-dev@hadoop.apache.org
>>>>>>>> Subject: RE: Hadoop encryption module as Apache Chimera
>>>>>>>> incubator project
>>>>>>>>
>>>>>>>>>> I believe encryption is becoming a core part of Hadoop.
I
>>>>>>>>>>think that moving core components out of Hadoop is
bad from a
>>>>>>>>>>project management perspective.
>>>>>>>>
>>>>>>>>> Although it's certainly true that encryption capabilities
(in
>>>>>>>>>HDFS, YARN, etc.) are becoming core to Hadoop, I don't
think
>>>>>>>>>that should really influence whether or not the
>>>>>>>>>non-Hadoop-specific encryption routines should be part
of the
>>>>>>>>>Hadoop code base, or part of the code base of another
project
>>>>>>>>>that Hadoop depends on.
>>>>>>>>>If Chimera had existed as a library hosted at ASF when
HDFS
>>>>>>>>>encryption was first developed, HDFS probably would have
just
>>>>>>>>>added that as a dependency and been done with it. I don't
think
>>>>>>>>>we would've copy/pasted the code for Chimera into the
Hadoop code
>>>>>>>>>base.
>>>>>>>>
>>>>>>>> Agree with ATM. I want to also make an additional clarification.
>>>>>>>>I agree that the encryption capabilities are becoming core
to
>>>>>>>>Hadoop.
>>>>>>>>While this effort is to put common and shared encryption routines
>>>>>>>>such as crypto stream implementations into a scope which can
be
>>>>>>>>widely shared across the Apache ecosystem. This doesn't move
>>>>>>>>Hadoop encryption out of Hadoop (that is not possible).
>>>>>>>>
>>>>>>>> Agree if we make it a separate and independent releases project
>>>>>>>>in Hadoop takes a step further than the existing approach
and
>>>>>>>>solve some issues (such as libhadoop.so problem). Frankly
>>>>>>>>speaking, I think it is not the best option we can try. I
also
>>>>>>>>expect that an independent release project within Hadoop core
>>>>>>>>will also complicate the existing release ideology of Hadoop
>>>>>>>>release.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Haifeng
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Aaron T. Myers [mailto:atm@cloudera.com]
>>>>>>>> Sent: Friday, January 29, 2016 9:51 AM
>>>>>>>> To: hdfs-dev@hadoop.apache.org
>>>>>>>> Subject: Re: Hadoop encryption module as Apache Chimera
>>>>>>>> incubator project
>>>>>>>>
>>>>>>>> On Wed, Jan 27, 2016 at 11:31 AM, Owen O'Malley
>>>>>>>><omalley@apache.org>
>>>>>>>>wrote:
>>>>>>>>
>>>>>>>>> I believe encryption is becoming a core part of Hadoop.
I think
>>>>>>>>>that  moving core components out of Hadoop is bad from
a project
>>>>>>>>>management perspective.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Although it's certainly true that encryption capabilities
(in
>>>>>>>>HDFS,  YARN,
>>>>>>>> etc.) are becoming core to Hadoop, I don't think that should
>>>>>>>>really influence whether or not the non-Hadoop-specific
>>>>>>>>encryption routines should be part of the Hadoop code base,
or
>>>>>>>>part of the code base of another project that Hadoop depends
on.
>>>>>>>>If Chimera had existed as a library hosted at ASF when HDFS
>>>>>>>>encryption was first developed, HDFS probably would have just
>>>>>>>>added that as a dependency and been done with it. I don't
think
>>>>>>>>we would've copy/pasted the code for Chimera into the Hadoop
code
>>>>>>>>base.
>>>>>>>>
>>>>>>>>
>>>>>>>>> To put it another way, a bug in the encryption routines
will
>>>>>>>>>likely become a security problem that security@hadoop
needs to
>>>>>>>>>hear about.
>>>>>>>>>
>>>>>>>> I don't think
>>>>>>>>> adding a separate project in the middle of that communication
>>>>>>>>>chain  is a good idea. The same applies to data corruption
>>>>>>>>>problems, and so on...
>>>>>>>>>
>>>>>>>>
>>>>>>>> Isn't the same true of all the libraries that Hadoop currently
>>>>>>>>depends upon? If the commons-httpclient library (or
>>>>>>>>commons-codec, or commons-io, or guava, or...) has a security
>>>>>>>>vulnerability, we need to know about it so that we can update
our
>>>>>>>>dependency to a fixed version.
>>>>>>>>This case doesn't seem materially different than that.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> > It may be good to keep at generalized place(As in
the
>>>>>>>>> > discussion, we thought that place could be Apache
Commons).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Apache Commons is a collection of *Java* projects, so
Chimera
>>>>>>>>> as a JNI-based library isn't a natural fit.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Could very well be that Apache Commons's charter would preclude
>>>>>>>>Chimera.
>>>>>>>> You probably know better than I do about that.
>>>>>>>>
>>>>>>>>
>>>>>>>>> Furthermore, Apache Commons doesn't have its own security
list
>>>>>>>>> so problems will go to the generic security@apache.org.
>>>>>>>>>
>>>>>>>>
>>>>>>>> That seems easy enough to remedy, if they wanted to, and
besides
>>>>>>>>I'm not sure why that would influence this discussion. In
my
>>>>>>>>experience projects that don't have a separate
>>>>>>>>security@project.a.o mailing list tend to just handle security
>>>>>>>>issues on their private@project.a.o mailing list, which seems
fine
>>>>>>>>to me.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Why do you think that Apache Commons is a better home
than
>>>>>>>>>Hadoop?
>>>>>>>>>
>>>>>>>>
>>>>>>>> I'm certainly not at all wedded to Apache Commons, that just
>>>>>>>>seemed like a natural place to put it to me. Could be that
a
>>>>>>>>brand new TLP might make more sense.
>>>>>>>>
>>>>>>>> I *do* think that if other non-Hadoop projects want to make
use
>>>>>>>>of Chimera, which as I understand it is the goal which started
>>>>>>>>this thread, then Chimera should exist outside of Hadoop so
that:
>>>>>>>>
>>>>>>>> a) Projects that have nothing to do with Hadoop can just
depend
>>>>>>>>directly on Chimera, which has nothing Hadoop-specific in
there.
>>>>>>>>
>>>>>>>> b) The Hadoop project doesn't have to export/maintain/concern
>>>>>>>>itself with yet another publicly-consumed interface.
>>>>>>>>
>>>>>>>> c) Chimera can have its own (presumably much faster) release
>>>>>>>>cadence completely separate from Hadoop.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Aaron T. Myers
>>>>>>>> Software Engineer, Cloudera
>>>>>
>>

Mime
View raw message