Return-Path: Delivered-To: apmail-incubator-general-archive@www.apache.org Received: (qmail 68189 invoked from network); 10 Feb 2011 20:37:39 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 10 Feb 2011 20:37:39 -0000 Received: (qmail 13217 invoked by uid 500); 10 Feb 2011 20:37:38 -0000 Delivered-To: apmail-incubator-general-archive@incubator.apache.org Received: (qmail 12995 invoked by uid 500); 10 Feb 2011 20:37:38 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 12986 invoked by uid 99); 10 Feb 2011 20:37:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Feb 2011 20:37:37 +0000 X-ASF-Spam-Status: No, hits=0.7 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [216.145.54.173] (HELO mrout3.yahoo.com) (216.145.54.173) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Feb 2011 20:37:33 +0000 Received: from [192.168.0.199] (snvvpn1-10-72-244-c157.hq.corp.yahoo.com [10.72.244.157]) by mrout3.yahoo.com (8.13.8/8.13.8/y.out) with ESMTP id p1AKb7m2052554; Thu, 10 Feb 2011 12:37:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=yahoo-inc.com; s=cobra; t=1297370229; bh=3mVEEunMkxVbt5GTwgboxdr/rzITgC3OCw6YPL02Ev4=; h=Message-Id:From:To:Content-Type:Content-Transfer-Encoding:Subject: Mime-Version:Date; b=kdEpGexyH3j4AAMPJHHJTjw/wFwO13AVdhM7+pEOI6rBKnBbpTNWGMu+nzv8QhdOp l/Qruaryzqf+Ocwt6kI70cR9cM37qb2Xd9gf6IbmYicD4g2G0VM+VgeFvCYiGRGCTQ sdUvK+6TAesm5p1I56BgNQkXxCkzGO8M/thASSDk= Message-Id: <90DA7D39-E071-44BD-B59E-7D91EC01265A@yahoo-inc.com> From: Alan Gates To: general@incubator.apache.org Content-Type: text/plain; charset=WINDOWS-1252; format=flowed; delsp=yes Content-Transfer-Encoding: quoted-printable Subject: [PROPOSAL] Propose Howl as an Apache Incubator project Mime-Version: 1.0 (Apple Message framework v936) Date: Thu, 10 Feb 2011 12:37:07 -0800 X-Mailer: Apple Mail (2.936) I would like to propose Howl as an Apache Incubator project. Howl is =20= a table and storage management service for data created using Apache =20 Hadoop. The proposal is on the Incubator wiki at = http://wiki.apache.org/incubator/HowlProposal=20 and is pasted below. Thanks. Alan. =3D=3D Abstract =3D=3D Howl is a table and storage management service for data created using =20= Apache Hadoop. =3D=3D Proposal =3D=3D The vision of Howl is to provide table management and storage =20 management layers for Apache Hadoop. This includes: * Providing a shared schema and data type mechanism. * Providing a table abstraction so that users need not be concerned =20= with where or how their data is stored. * Providing interoperability across data processing tools such as =20 Pig, Map Reduce, Streaming, and Hive. =3D=3D Background =3D=3D Data processors using Apache Hadoop have a common need for table =20 management services. The goal of a table management service is to =20 track data that exists in a Hadoop grid and present that data to users =20= in a tabular format. Such a table management service needs to provide =20= a single input and output format to users so that individual users =20 need not be concerned with the storage formats that are chosen for =20 particular data sets. As part of having a single format, the data =20 will need to be described by one type of schema and have a single =20 datatype system. Additionally, users should be free to choose the best tools for their =20= use cases. The Hadoop project includes Map Reduce, Streaming, Pig, =20 and Hive, and additional tools exist such as Cascading. Each of these =20= tools has users who prefer it, and there are use cases best addressed =20= by each of these tools. Two users on the same grid who need to share =20= data should not be constrained to use the same tool but rather should =20= be free to choose the best tool for their use case. A table =20 management service that presents data in the same way to all of the =20 tools can alleviate this problem by providing interfaces to each of =20 the data processing tools. There are also a few other features a table management service should =20= provide, such as notification of when data arrives. A couple of developers at Yahoo! started the project. It is based on =20 the Hive !MetaStore component. There is good amount of interest in =20 such a service expressed from Yahoo!, Facebook, !LinkedIn, and, =20 others. We are therefore proposing to place Howl in the Apache =20 incubator and to build an open source community around it. =3D=3D Rationale =3D=3D There is a strong need for a table management service, especially for =20= large grids with petabytes of data, and where the data volume is =20 increasing by the day. Hadoop users need to find data to read and have =20= a place to store their data. Currently users must understand the =20 location of data to read, the storage format, compression techniques =20 used, etc. To write data they need to understand where on HDFS their =20= data belongs, the best compression format to use, how their data =20 should be serialized, etc. Most users do not want to be concerned with these issues. They want =20 these managed for them. Having it as an Apache Open Source project will highly benefit Howl =20 from the point of view of getting a large community that currently =20 uses Hadoop and the other products built around Hadoop (like Pig, =20 Hive, etc.). Users of the Hadoop ecosystem can influence Howl=92s =20 roadmap, and contribute to it. Looking at it in another way, we =20 believe having Howl as part of the Hadoop ecosystem will be a great =20 benefit to the current Hadoop/Pig/Hive community too. =3D=3D Current Status =3D=3D =3D=3D=3D Meritocracy =3D=3D=3D Our intent with this incubator proposal is to start building a diverse =20= developer community around Howl following the Apache meritocracy =20 model. We have wanted to make the project open source and encourage =20 contributors from multiple organizations from the start. We plan to =20 provide plenty of support to new developers and to quickly recruit =20 those who make solid contributions to committer status. =3D=3D=3D Community =3D=3D=3D Howl is currently being used by developers at Yahoo! and there has =20 been an expressed interest from !LinkedIn and Facebook. Yahoo! also =20 plans to deploy the current version of Howl in production soon. We =20 hope to extend the user and developer base further in the future. The =20= current developers and users are all interested in building a solid =20 open source community around Howl. To work towards an open source community, we have started using the !=20 GitHub issue tracker and mailing lists at Yahoo! for development =20 discussions within our group. =3D=3D=3D Core Developers =3D=3D=3D Howl is currently being developed by four engineers from Yahoo! - =20 Devaraj Das, Ashutosh Chauhan, Sushanth Sowmyan, and Mac Yang. All the =20= engineers have deep expertise in Hadoop and the Hadoop Ecosystem in =20 general. =3D=3D=3D Alignment =3D=3D=3D The ASF is a natural host for Howl given that it is already the home =20 of Hadoop, Pig, HBase, Cassandra, and other emerging cloud software =20 projects. Howl was designed to support Hadoop from the beginning in =20 order to solve data management challenges in Hadoop clusters. Howl =20 complements the existing Apache cloud computing projects by providing =20= a unified way to manage data. =3D=3D Known Risks =3D=3D =3D=3D=3D Orphaned Products =3D=3D=3D The core developers plan to work full time on the project. There is =20 very little risk of Howl getting orphaned since large companies like =20 Yahoo! are planning to deploy this in their production Hadoop =20 clusters. We believe we can build an active developer community around =20= Howl (companies like Facebook and !LinkedIn have also expressed =20 interest). =3D=3D=3D Inexperience with Open Source =3D=3D=3D All of the core developers are active users and followers of open =20 source. Devaraj Das is an Apache Hadoop committer and Apache Hadoop =20 PMC member, and has experience with the Apache infrastructure and =20 development process. Ashutosh Chauhan is an Apache Pig committer and =20 Apache Pig PMC member. Sushanth Sowmyan and Mac Yang made =20 contributions to the Apache Hive and the Apache Chukwa projects. =3D=3D=3D Homogeneous Developers =3D=3D=3D The current core developers are all from Yahoo! However, we hope to =20 establish a developer community that includes contributors from =20 several corporations, and we are starting to work towards this with =20 Facebook and !LinkedIn. =3D=3D=3D Reliance on Salaried Developers =3D=3D=3D Currently, the developers are paid to do work on Howl. However, once =20 the project has a community built around it, we expect to get =20 committers and developers from outside the current core developers. =20 Companies like Yahoo! are invested in Howl being a solution to the =20 data management problem in Hadoop clusters, and that is not likely to =20= change. =3D=3D=3D Relationships with Other Apache Products =3D=3D=3D Howl is going to be used by users of Hadoop, Pig, and Hive. See =20 section Initial Source below for more information about Howl's =20 relationship to Hive. =3D=3D=3D An Excessive Fascination with the Apache Brand =3D=3D=3D While we respect the reputation of the Apache brand and have no doubts =20= that it will attract contributors and users, our interest is primarily =20= to give Howl a solid home as an open source project following an =20 established development model. We have also given reasons in the =20 Rationale and Alignment sections. =3D=3D Documentation =3D=3D Information about Howl can be found at http://wiki.apache.org/pig/=20 Howl. The following sources may be useful to start with: * The !GitHub site: https://github.com/yahoo/howl * The roadmap: http://wiki.apache.org/pig/HowlJournal =3D=3D Initial Source =3D=3D Howl has been under development since Summer 2010 by a team of =20 engineers in Yahoo!. It is currently hosted on !GitHub under an =20 Apache license at https://github.com/yahoo/howl. The initial development of Howl has consisted of: * maintaining a branch of the entire Hive codebase * getting Howl-related patches committed to Hive * developing Howl-specific plugins and wrappers to customize Hive =20 behavior At runtime, Howl executes Hive code for metastore and CLI+DDL, =20 disabling anything related to Hadoop map/reduce execution. It also =20 makes use of the RCFile storage format contained in Hive. This approach was taken as a first step in order to validate the =20 required functionality and get a production version working. However, =20= in the long-term, maintaining a clone of Hive is undesirable. One =20 possible resolution is to factor the metastore+CLI+DDL components out =20= of Hive and move them into Howl (making Hive dependent on Howl). =20 Another possible resolution is to remove the copy of Hive from Howl =20 and do the build/release engineering necessary to make Howl depend on =20= Hive. As part of the incubation process, we plan to work towards =20 resolution of these issues. =3D=3D External Dependencies =3D=3D The dependencies all have Apache compatible licenses. =3D=3D Cryptography =3D=3D Not applicable. =3D=3D Required Resources =3D=3D =3D=3D=3D Mailing Lists =3D=3D=3D * howl-private for private PMC discussions (with moderated =20 subscriptions) * howl-dev * howl-commits * howl-user =3D=3D=3D Subversion Directory =3D=3D=3D https://svn.apache.org/repos/asf/incubator/howl =3D=3D=3D Issue Tracking =3D=3D=3D JIRA Howl (HOWL) =3D=3D=3D Other Resources =3D=3D=3D The existing code already has unit tests, so we would like a Hudson =20 instance to run them whenever a new patch is submitted. This can be =20 added after project creation. =3D=3D Initial Committers =3D=3D * Devaraj Das * Ashutosh Chauhan * Sushanth Sowmyan * Mac Yang * Paul Yang * Alan Gates A CLA is already on file for Sushanth. =3D=3D Affiliations =3D=3D * Devaraj Das (Yahoo!) * Ashutosh Chauhan (Yahoo!) * Sushanth Sowmyan (Yahoo!) * Mac Yang (Yahoo!) * Paul Yang (Facebook) * Alan Gates (Yahoo!) =3D=3D Sponsors =3D=3D =3D=3D=3D Champion =3D=3D=3D Owen O=92Malley =3D=3D=3D Nominated Mentors =3D=3D=3D * Olga Natkovich (Pig PMC member and Apache VP for Pig) * Alan Gates (Pig PMC member) * John Sichi (Hive PMC member) =3D=3D=3D Sponsoring Entity =3D=3D=3D We are requesting the Incubator to sponsor this project. --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org For additional commands, e-mail: general-help@incubator.apache.org