Return-Path: X-Original-To: apmail-hadoop-common-dev-archive@www.apache.org Delivered-To: apmail-hadoop-common-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 83AA6E643 for ; Mon, 25 Feb 2013 23:50:04 +0000 (UTC) Received: (qmail 18948 invoked by uid 500); 25 Feb 2013 23:50:02 -0000 Delivered-To: apmail-hadoop-common-dev-archive@hadoop.apache.org Received: (qmail 18864 invoked by uid 500); 25 Feb 2013 23:50:01 -0000 Mailing-List: contact common-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-dev@hadoop.apache.org Delivered-To: mailing list common-dev@hadoop.apache.org Delivered-To: moderator for common-dev@hadoop.apache.org Received: (qmail 14911 invoked by uid 99); 25 Feb 2013 23:47:14 -0000 X-ASF-Spam-Status: No, hits=-4.0 required=5.0 tests=RCVD_IN_DNSWL_HI,SPF_SOFTFAIL X-Spam-Check-By: apache.org Received-SPF: softfail (athena.apache.org: transitioning domain of avik.dey@intel.com does not designate 192.55.52.88 as permitted sender) X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.84,736,1355126400"; d="scan'208";a="291700821" From: "Dey, Avik" To: "general@hadoop.apache.org" CC: "common-dev@hadoop.apache.org" , "mapreduce-dev@hadoop.apache.org" , "yarn-dev@hadoop.apache.org" , "dev@hbase.apache.org" Subject: ANNOUNCEMENT: Project Rhino: Enhanced Data Protection for the Apache Hadoop Ecosystem Thread-Topic: ANNOUNCEMENT: Project Rhino: Enhanced Data Protection for the Apache Hadoop Ecosystem Thread-Index: AQHOE7JeD7r+p84/lEqH5+6pNr8/Wg== Date: Mon, 25 Feb 2013 23:46:45 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.3.128.77] Content-Type: text/plain; charset="us-ascii" Content-ID: <4A587C712458A44A9B25DE857AF88CAD@intel.com> Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org Project Rhino As the Apache Hadoop ecosystem extends into new markets and sees new use ca= ses with security and compliance challenges, the benefits of processing sen= sitive and legally protected data with Hadoop must be coupled with protecti= on for private information that limits performance impact. Project Rhino is our open source effort to = enhance the existing data protection capabilities of the Hadoop ecosystem t= o address these challenges, and contribute the code back to Apache. The core of the Apache Hadoop ecosystem as it is commonly understood is: - Core: A set of shared libraries - HDFS: The Hadoop filesystem - MapReduce: Parallel computation framework - ZooKeeper: Configuration management and coordination - HBase: Column-oriented database on HDFS - Hive: Data warehouse on HDFS with SQL-like access - Pig: Higher-level programming language for Hadoop computations - Oozie: Orchestration and workflow management - Mahout: A library of machine learning and data mining algorithms - Flume: Collection and import of log and event data - Sqoop: Imports data from relational databases These components are all separate projects and therefore cross cutting conc= erns like authN, authZ, a consistent security policy framework, consistent = authorization model and audit coverage are loosely coordinated. Some securi= ty features expected by our customers, such as encryption, are simply missi= ng. Our aim is to take a full stack view and work with the individual proje= cts toward consistent concepts and capabilities, filling gaps as we go. Our initial goals are: 1) Framework support for encryption and key management There is currently no framework support for encryption or key management. W= e will add this support into Hadoop Core and integrate it across the ecosys= tem. 2) A common authorization framework for the Hadoop ecosystem Each component currently has its own authorization engine. We will abstract= the common functions into a reusable authorization framework with a consis= tent interface. Where appropriate we will either modify an existing engine = to work within this framework, or we will plug in a common default engine. = Therefore we also must normalize how security policy is expressed and appli= ed by each component. Core, HDFS, ZooKeeper, and HBase currently support si= mple access control lists (ACLs) composed of users and groups. We see this = as a good starting point. Where necessary we will modify components so they= each offer equivalent functionality, and build support into others. 3) Token based authentication and single sign on Core, HDFS, ZooKeeper, and HBase currently support Kerberos authentication = at the RPC layer, via SASL. However this does not provide valuable attribut= es such as group membership, classification level, organizational identity,= or support for user defined attributes. Hadoop components must interrogate= external resources for discovering these attributes and at scale this is p= roblematic. There is also no consistent delegation model. HDFS has a simple= delegation capability, and only Oozie can take limited advantage of it. We= will implement a common token based authentication framework to decouple i= nternal user and service authentication from external mechanisms used to su= pport it (like Kerberos). 4) Extend HBase support for ACLs to the cell level Currently HBase supports setting access controls at the table or column fam= ily level. However, many use cases would benefit from the additional capabi= lity to do this on a per cell basis. In fact for many users dealing with se= nsitive information the ability to do this is crucial. 5) Improve audit logging Audit messages from various Hadoop components do not use a unified or even = consistently formatted format. This makes analysis of logs for verifying co= mpliance or taking corrective action difficult. We will build a common audi= t logging facility as part of the common authorization framework work. We w= ill also build a set of common audit log processing tools for transforming = them to different industry standard formats, for supporting compliance veri= fication, and for triggering responses to policy violations. Current JIRAs: As part of this ongoing effort we are contributing our work to-date against= the JIRAs listed below. As you may appreciate, the goals for Project Rhino= covers a number of different Apache projects, the scope of work is signifi= cant and likely to only increase as we get additional community input. We a= lso appreciate that there may be others in the Apache community that may be= working on some of this or are interested in contributing to it. If so, we= look forward to partnering with you in Apache to accelerate this effort so= the Apache community can see the benefits from our collective efforts soon= er. You can also find a more detailed version of this announcement at Proje= ct Rhino. Please feel free to reach out to us by commenting on the JIRAs below: HBASE-6222: Add per-KeyValue Security HADOOP-9331: Hadoop crypto codec framework and crypto codec implementations= and related sub-tasks MAPREDUCE-5025: Key Distribution and Management for supporting crypto codec= in Map Reduce and re= lated JIRAs HBASE-7544: Transparent table/CF encryption