incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gates <ga...@yahoo-inc.com>
Subject [PROPOSAL] Propose Howl as an Apache Incubator project
Date Thu, 10 Feb 2011 20:37:07 GMT
I would like to propose Howl as an Apache Incubator project.  Howl is  
a table and storage management service for data created using Apache  
Hadoop.  The proposal is on the Incubator wiki at http://wiki.apache.org/incubator/HowlProposal

  and is pasted below.  Thanks.

Alan.

== Abstract ==
Howl is a table and storage management service for data created using  
Apache Hadoop.

== Proposal ==
The vision of Howl is to provide table management and storage  
management layers for Apache Hadoop.  This includes:
  * Providing a shared schema and data type mechanism.
  * Providing a table abstraction so that users need not be concerned  
with where or how their data is stored.
  * Providing interoperability across data processing tools such as  
Pig, Map Reduce, Streaming, and Hive.

== Background ==
Data processors using Apache Hadoop have a common need for table  
management services.  The goal of a table management service is to  
track data that exists in a Hadoop grid and present that data to users  
in a tabular format.  Such a table management service needs to provide  
a single input and output format to users so that individual users  
need not be concerned with the storage formats that are chosen for  
particular data sets.  As part of having a single format, the data  
will need to be described by one type of schema and have a single  
datatype system.

Additionally, users should be free to choose the best tools for their  
use cases.  The Hadoop project includes Map Reduce, Streaming, Pig,  
and Hive, and additional tools exist such as Cascading.  Each of these  
tools has users who prefer it, and there are use cases best addressed  
by each of these tools.  Two users on the same grid who need to share  
data should not be constrained to use the same tool but rather should  
be free to choose the best tool for their use case.  A table  
management service that presents data in the same way to all of the  
tools can alleviate this problem by providing interfaces to each of  
the data processing tools.

There are also a few other features a table management service should  
provide, such as notification of when data arrives.

A couple of developers at Yahoo! started the project. It is based on  
the Hive !MetaStore component. There is good amount of interest in  
such a service expressed from Yahoo!, Facebook, !LinkedIn, and,  
others. We are therefore proposing to place Howl in the Apache  
incubator and to build an open source community around it.


== Rationale ==
There is a strong need for a table management service, especially for  
large grids with petabytes of data, and where the data volume is  
increasing by the day. Hadoop users need to find data to read and have  
a place to store their data.  Currently users must understand the  
location of data to read, the storage format, compression techniques  
used, etc.  To write data they need to understand where on HDFS their  
data belongs, the best compression format to use, how their data  
should be serialized, etc.

Most users do not want to be concerned with these issues.  They want  
these managed for them.

Having it as an Apache Open Source project will highly benefit Howl  
from the point of view of getting a large community that currently  
uses Hadoop and the other products built around Hadoop (like Pig,  
Hive, etc.). Users of the Hadoop ecosystem can influence Howl’s  
roadmap, and contribute to it. Looking at it in another way, we  
believe having Howl as part of the Hadoop ecosystem will be a great  
benefit to the current Hadoop/Pig/Hive community too.

== Current Status ==
=== Meritocracy ===
Our intent with this incubator proposal is to start building a diverse  
developer community around Howl following the Apache meritocracy  
model. We have wanted to make the project open source and encourage  
contributors from multiple organizations from the start. We plan to  
provide plenty of support to new developers and to quickly recruit  
those who make solid contributions to committer status.

=== Community ===
Howl is currently being used by developers at Yahoo! and there has  
been an expressed interest from !LinkedIn and Facebook. Yahoo! also  
plans to deploy the current version of Howl in production soon. We  
hope to extend the user and developer base further in the future. The  
current developers and users are all interested in building a solid  
open source community around Howl.

To work towards an open source community, we have started using the ! 
GitHub issue tracker and mailing lists at Yahoo! for development  
discussions within our group.

=== Core Developers ===
Howl is currently being developed by four engineers from Yahoo! -  
Devaraj Das, Ashutosh Chauhan, Sushanth Sowmyan, and Mac Yang. All the  
engineers have deep expertise in Hadoop and the Hadoop Ecosystem in  
general.

=== Alignment ===
The ASF is a natural host for Howl given that it is already the home  
of Hadoop, Pig, HBase, Cassandra, and other emerging cloud software  
projects. Howl was designed to support Hadoop from the beginning in  
order to solve data management challenges in Hadoop clusters. Howl  
complements the existing Apache cloud computing projects by providing  
a unified way to manage data.

== Known Risks ==
=== Orphaned Products ===
The core developers plan to work full time on the project. There is  
very little risk of Howl getting orphaned since large companies like  
Yahoo! are planning to deploy this in their production Hadoop  
clusters. We believe we can build an active developer community around  
Howl (companies like Facebook and !LinkedIn have also expressed  
interest).

=== Inexperience with Open Source ===
All of the core developers are active users and followers of open  
source. Devaraj Das is an Apache Hadoop committer and Apache Hadoop  
PMC member, and has experience with the Apache infrastructure and  
development process. Ashutosh Chauhan is an Apache Pig committer and  
Apache Pig PMC member.  Sushanth Sowmyan and Mac Yang made  
contributions to the Apache Hive and the Apache Chukwa projects.

=== Homogeneous Developers ===
The current core developers are all from Yahoo! However, we hope to  
establish a developer community that includes contributors from  
several corporations, and we are starting to work towards this with  
Facebook and !LinkedIn.

=== Reliance on Salaried Developers ===
Currently, the developers are paid to do work on Howl. However, once  
the project has a community built around it, we expect to get  
committers and developers from outside the current core developers.  
Companies like Yahoo! are invested in Howl being a solution to the  
data management problem in Hadoop clusters, and that is not likely to  
change.

=== Relationships with Other Apache Products ===
Howl is going to be used by users of Hadoop, Pig, and Hive. See  
section Initial Source below for more information about Howl's  
relationship to Hive.

=== An Excessive Fascination with the Apache Brand ===
While we respect the reputation of the Apache brand and have no doubts  
that it will attract contributors and users, our interest is primarily  
to give Howl a solid home as an open source project following an  
established development model. We have also given reasons in the  
Rationale and Alignment sections.

== Documentation ==
Information about Howl can be found at http://wiki.apache.org/pig/ 
Howl. The following sources may be useful to start with:
  * The !GitHub site: https://github.com/yahoo/howl
  * The roadmap: http://wiki.apache.org/pig/HowlJournal

== Initial Source ==
Howl has been under development since Summer 2010 by a team of  
engineers in Yahoo!.  It is currently hosted on !GitHub under an  
Apache license at https://github.com/yahoo/howl.

The initial development of Howl has consisted of:

  * maintaining a branch of the entire Hive codebase
  * getting Howl-related patches committed to Hive
  * developing Howl-specific plugins and wrappers to customize Hive  
behavior

At runtime, Howl executes Hive code for metastore and CLI+DDL,  
disabling anything related to Hadoop map/reduce execution.  It also  
makes use of the RCFile storage format contained in Hive.

This approach was taken as a first step in order to validate the  
required functionality and get a production version working.  However,  
in the long-term, maintaining a clone of Hive is undesirable.  One  
possible resolution is to factor the metastore+CLI+DDL components out  
of Hive and move them into Howl (making Hive dependent on Howl).   
Another possible resolution is to remove the copy of Hive from Howl  
and do the build/release engineering necessary to make Howl depend on  
Hive.  As part of the incubation process, we plan to work towards  
resolution of these issues.

== External Dependencies ==
The dependencies all have Apache compatible licenses.

== Cryptography ==
Not applicable.

== Required Resources ==
=== Mailing Lists ===
  * howl-private for private PMC discussions (with moderated  
subscriptions)
  * howl-dev
  * howl-commits
  * howl-user
=== Subversion Directory ===
https://svn.apache.org/repos/asf/incubator/howl

=== Issue Tracking ===
JIRA Howl (HOWL)

=== Other Resources ===
The existing code already has unit tests, so we would like a Hudson  
instance to run them whenever a new patch is submitted. This can be  
added after project creation.

== Initial Committers ==
  * Devaraj Das
  * Ashutosh Chauhan
  * Sushanth Sowmyan
  * Mac Yang
  * Paul Yang
  * Alan Gates
A CLA is already on file for Sushanth.

== Affiliations ==
  * Devaraj Das (Yahoo!)
  * Ashutosh Chauhan (Yahoo!)
  * Sushanth Sowmyan (Yahoo!)
  * Mac Yang (Yahoo!)
  * Paul Yang (Facebook)
  * Alan Gates (Yahoo!)

== Sponsors ==
=== Champion ===
Owen O’Malley

=== Nominated Mentors ===
  * Olga Natkovich (Pig PMC member and Apache VP for Pig)
  * Alan Gates (Pig PMC member)
  * John Sichi (Hive PMC member)

=== Sponsoring Entity ===
We are requesting the Incubator to sponsor this project.


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message