Mailing-List: contact hdfs-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-dev@hadoop.apache.org
From: "Zheng, Kai" <kai.zheng@intel.com>
To: "hdfs-dev@hadoop.apache.org" <hdfs-dev@hadoop.apache.org>
Subject: RE: Hadoop encryption module as Apache Chimera incubator project
Thread-Topic: Hadoop encryption module as Apache Chimera incubator project
Thread-Index: AQHRUo2FRYvQ78cdYEW15gf7dom5bJ8Emw+AgAAu+oCAAMKRgA==
Date: Thu, 21 Jan 2016 07:16:30 +0000
Message-ID: 
 <8D5F7E3237B3ED47B84CF187BB17B66614875052@SHSMSX103.ccr.corp.intel.com>
References: <D2C329DA.A315%uma.gangumalla@intel.com>
 <CAGB5D2av2Y6-S1Mk2iXhv4GPtUae9-H8DyhkJFAx19CpZdcv9A@mail.gmail.com>
 <D2C58B12.A5C9%uma.gangumalla@intel.com>
In-Reply-To: <D2C58B12.A5C9%uma.gangumalla@intel.com>
Accept-Language: en-US
Content-Language: en-US
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Thanks Uma.=20

I have a question by the way, it's not about Chimera project, but about the=
 mentioned advantage 1 and libhadoop.so installation problem. I copied the =
saying as below for convenience.

>>1. As Chimera embedded the native in jar (similar to Snappy java), it sol=
ves the current issues in Hadoop that a HDFS client has to depend libhadoop=
.so if the client needs to read encryption zone in HDFS. This means a HDFS =
client may has to depend a Hadoop installation in local machine. For exampl=
e, HBase uses depends on HDFS client jar other than a Hadoop installation a=
nd then has no access to libhadoop.so. So HBase cannot use an encryption zo=
ne or it cause error.

I believe Haifeng had mentioned the problem in a call when discussing erasu=
re coding work, but until now I got to understand what's the  problem and h=
ow Chimera or Snappy Java solved it. It looks like there can be some thin c=
lients that don't rely on Hadoop installation so no libhadoop.so is availab=
le to use on the client host. The approach mentioned here is to bundle the =
library file (*.so) into a jar and dynamically extract the file when loadin=
g it. When no library file is contained in the jar then it goes to the norm=
al case, loading it from an installation. It's smart and nice! My question =
is, could we consider to adopt the approach for libhadoop.so library? It mi=
ght be worth to discuss because, we're bundling more and more things into t=
he library (recently we just put Intel ISA-L support into it), and such thi=
ngs may be desired for such clients. It may also be helpful for development=
, because sometimes when run unit tests that involve native codes, some err=
or may happen and complain no place to find libhadoop.so. Thanks.

Regards,
Kai

-----Original Message-----
From: Gangumalla, Uma [mailto:uma.gangumalla@intel.com]=20
Sent: Thursday, January 21, 2016 11:20 AM
To: hdfs-dev@hadoop.apache.org
Subject: Re: Hadoop encryption module as Apache Chimera incubator project

Hi All,
Thanks Andrew, ATM, Yi, Kai, Larry. Thanks Haifeng on clarifying release st=
uff.

Please find my responses below.

Andrew wrote:
If it becomes part of Apache Commons, could we make Chimera a separate JAR?=
 We have real difficulties bumping dependency versions right now, so ideall=
y we don't need to bump our existing Commons dependencies to use Chimera.
[UMA] Yes, We plan to make separate Jar.

Andrew wrote:
With this refactoring, do we have confidence that we can get our desired ch=
anges merged and released in a timely fashion? e.g. if we find another bug =
like HADOOP-11343, we'll first need to get the fix into Chimera, have a new=
 Chimera release, then bump Hadoop's Chimera dependency. This also relates =
to the previous point, it's easier to do this dependency bump if Chimera is=
 a separate JAR.
[UMA] Yes and the main target users for this project is Hadoop and Spark ri=
ght now.=20
So, Hadoop requirements would be the priority tasks for it.


ATM wrote:
Uma, would you be up for approaching the Apache Commons folks saying that y=
ou'd like to contribute Chimera? I'd recommend saying that Hadoop and Spark=
 are both on board to depend on this.
[UMA] Yes, will do that.


Kai wrote:
Just a question. Becoming a separate jar/module in Apache Commons means Chi=
mera or the module can be released separately or in a timely manner, not co=
upling with other modules for release in the project? Thanks.

[Haifeng] From apache commons project web (https://commons.apache.org/), we=
 see there is already a long list of components in its Apache Commons Prope=
r list. Each component has its own release version and date. To join and be=
 one of the list is the target.

Larry wrote:
If what we are looking for is some level of autonomy then it would need to =
be a module with its own release train - or at least be able to.

[UMA] Yes. Agree

Kai wrote:
So far I saw it's mainly about AES-256. I suggest the scope can be expanded=
 a little bit, perhaps a dedicated high performance encryption library, the=
n we would have quite much to contribute to it, like other ciphers, MACs, P=
RNGs and so on. Then both Hadoop and Spark can benefit from it.

[UMA] Yes, once development started as separate project then its free to ev=
olve and provide more improvements to support more customer/user space for =
encryption based on demand.
Haifeng, would you add some points here?


Regards,
Uma

On 1/20/16, 4:31 PM, "Andrew Wang" <andrew.wang@cloudera.com> wrote:

>Thanks Uma for putting together this proposal. Overall sounds good to=20
>me,
>+1 for these improvements. A few comments/questions:
>
>* If it becomes part of Apache Commons, could we make Chimera a=20
>separate JAR? We have real difficulties bumping dependency versions=20
>right now, so ideally we don't need to bump our existing Commons=20
>dependencies to use Chimera.
>* With this refactoring, do we have confidence that we can get our=20
>desired changes merged and released in a timely fashion? e.g. if we=20
>find another bug like HADOOP-11343, we'll first need to get the fix=20
>into Chimera, have a new Chimera release, then bump Hadoop's Chimera=20
>dependency. This also relates to the previous point, it's easier to do=20
>this dependency bump if Chimera is a separate JAR.
>
>Best,
>Andrew
>
>On Mon, Jan 18, 2016 at 11:46 PM, Gangumalla, Uma=20
><uma.gangumalla@intel.com>
>wrote:
>
>> Hi Devs,
>>
>>   Some of our Hadoop developers working with Spark community to=20
>>implement  the shuffle encryption. While implementing that, they=20
>>realized some/most of  the code in Hadoop encryption code and their =20
>>implemention code have to be  duplicated. This leads to an idea to=20
>>create separate library, named it as  Chimera=20
>>(https://github.com/intel-hadoop/chimera). It is an optimized =20
>>cryptographic library. It provides Java API for both cipher level and=20
>>Java  stream level to help developers implement high performance AES =20
>>encryption/decryption with the minimum code and effort. Chimera was =20
>>originally based Hadoop crypto code but was improved and generalized a=20
>>lot  for supporting wider scope of data encryption needs for more=20
>>components in  the community.
>>
>> So, now team is thinking to make this library code as open source=20
>>project  via Apache Incubation.  Proposal is Chimera to join the=20
>>Apache as  incubating or Apache commons for facilitating its adoption.
>>
>> In general this will get the following advantages:
>> 1. As Chimera embedded the native in jar (similar to Snappy java), it =20
>>solves the current issues in Hadoop that a HDFS client has to depend =20
>>libhadoop.so if the client needs to read encryption zone in HDFS. This =20
>>means a HDFS client may has to depend a Hadoop installation in local =20
>>machine. For example, HBase uses depends on HDFS client jar other than=20
>>a  Hadoop installation and then has no access to libhadoop.so. So=20
>>HBase cannot  use an encryption zone or it cause error.
>> 2. Apache Spark shuffle and spill encryption could be another example =20
>>where we can use Chimera. We see the fact that the stream encryption=20
>>for  Spark shuffle and spill doesn=B9t require a stream cipher like=20
>>AES/CTR,  although the code shares the common characteristics of a=20
>>stream style API.
>> We also see the need of optimized Cipher for non-stream style use=20
>>cases  such as network encryption such as RPC. These improvements=20
>>actually can be  shared by more projects of need.
>>
>> 3. Simplified code in Hadoop to use dedicated library. And drives=20
>> more improvements. For example, current the Hadoop crypto code API is=20
>> totally based on AES/CTR although it has cipher suite configurations.
>>
>> AES/CTR is for HDFS data encryption at rest, but it doesn=B9t necessary=
=20
>> to be AES/CTR for all the cases such as Data transfer encryption and=20
>> intermediate file encryption.
>>
>>
>>
>>  So, we wanted to check with Hadoop community about this proposal.
>>Please
>> provide your feedbacks on it.
>>
>> Regards,
>> Uma
>>