incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hyunsik Choi <hyun...@apache.org>
Subject Re: [VOTE] Accept DataFu into the Incubator
Date Thu, 02 Jan 2014 07:14:53 GMT
It's really nice work.

+1 (non-binding)

On Thu, Jan 2, 2014 at 12:27 PM, Imesh Gunaratne <imesh@apache.org> wrote:
> +1 (non-binding)
>
>
> On Wed, Jan 1, 2014 at 2:09 AM, Jakob Homan <jghoman@gmail.com> wrote:
>
>> Incubator-
>>
>> Following the discussion earlier, I'm calling a vote to accept DataFu as a
>> new Incubator project.
>>
>> The proposal draft is available at:
>> https://wiki.apache.org/incubator/DataFuProposal, and is also included
>> below.
>>
>> Vote is open for at least 96h and closes at the earliest on 4 Jan 13:00
>> PDT.  I'm letting the vote run an extra day as we're in the holiday season.
>>
>> [ ] +1 accept DataFu in the Incubator
>> [ ] +/-0
>> [ ] -1 because...
>>
>> Here's my binding +1.
>> -Jakob
>>
>> -------------------------------
>> Abstract
>>
>> DataFu makes it easier to solve data problems using Hadoop and higher level
>> languages based on it.
>>
>> Proposal
>>
>> DataFu provides a collection of Hadoop MapReduce jobs and functions in
>> higher level languages based on it to perform data analysis. It provides
>> functions for common statistics tasks (e.g. quantiles, sampling), PageRank,
>> stream sessionization, and set and bag operations. DataFu also provides
>> Hadoop jobs for incremental data processing in MapReduce.
>>
>> Background
>>
>> DataFu began two years ago as set of UDFs developed internally at LinkedIn,
>> coming from our desire to solve common problems with reusable components.
>> Recognizing that the community could benefit from such a library, we added
>> documentation, an extensive suite of unit tests, and open sourced the code.
>> Since then there have been steady contributions to DataFu as we encountered
>> common problems not yet solved by it. Others outside LinkedIn have
>> contributed as well. More recently we recognized the challenges with
>> efficient incremental processing of data in Hadoop and have contributed a
>> set of Hadoop MapReduce jobs as a solution.
>>
>> DataFu began as a project at LinkedIn, but it has shown itself to be useful
>> to other organizations and developers as well as they have faced similar
>> problems. We would like to share DataFu with the ASF and begin developing a
>> community of developers and users within Apache.
>>
>> Rationale
>>
>> There is a strong need for well tested libraries that help developers solve
>> common data problems in Hadoop and higher level languages such as Pig,
>> Hive, Crunch, Scalding, etc.
>>
>> Current Status
>>
>> Meritocracy
>>
>> Our intent with this incubator proposal is to start building a diverse
>> developer community around DataFu following the Apache meritocracy model.
>> Since DataFu was initially open sourced in 2011, it has received
>> contributions from both within and outside LinkedIn. We plan to continue
>> support for new contributors and work with those who contribute
>> significantly to the project to make them committers.
>>
>> Community
>>
>> DataFu has been building a community of developers for two years. It began
>> with contributors from LinkedIn and has received contributions from
>> developers at Cloudera since very early on. It has been included included
>> in Cloudera’s Hadoop Distribution and Apache Bigtop. We hope to extend our
>> contributor base significantly and invite all those who are interested in
>> solving large-scale data processing problems to participate.
>>
>> Core Developers
>>
>> DataFu has a strong base of developers at LinkedIn. Matthew Hayes initiated
>> the project in 2011, and aside from continued contributions to DataFu has
>> also contributed the sub-project Hourglass for incremental MapReduce
>> processing. Separate from DataFu he has also open sourced the White
>> Elephant project. Sam Shah contributed a significant portion of the
>> original code and continues to contribute to the project. William Vaughan
>> has been contributing regularly to DataFu for the past two years. Evion Kim
>> has been contributing to DataFu for the past year. Xiangrui Meng recently
>> contributed implementations of scalable sampling algorithms based on
>> research from a paper he published. Chris Lloyd has provided some important
>> bug fixes and unit tests. Mitul Tiwari has also contributed to DataFu.
>> Mathieu Bastian has been developing MapReduce jobs that we hope to include
>> in DataFu. In addition he also leads the open source Gephi project.
>>
>> Alignment
>>
>> The ASF is the natural choice to host the DataFu project as its goal of
>> encouraging community-driven open-source projects fits with our vision for
>> DataFu. Additionally, other projects DataFu integrates with, such as Apache
>> Pig and Apache Hadoop, and in the future Apache Hive and Apache Crunch, are
>> hosted by the ASF and we will benefit and provide benefit by close
>> proximity to them.
>>
>> Known Risks
>>
>> Orphaned Products
>>
>> The core developers have been contributing to DataFu for the past two
>> years. There is very little risk of DataFu being abandoned given its
>> widespread use within LinkedIn.
>>
>> Inexperience with Open Source
>>
>> DataFu was started as an open source project in 2011 and has remained so
>> for two years. Matt initiated the project, and additionally is the creator
>> of the open source White Elephant project. He has also contributed patches
>> to Apache Pig. Most recently he has released Hourglass as a sub-project of
>> DataFu. Sam contributed much of the original code and continues to
>> contribute to the project. Will has been contributing to DataFu since it
>> was first open sourced. Evion has been contributing for the past year.
>> Mathieu leads the open source Gephi project. Jakob has been actively
>> involved with the ASF as a full-time Hadoop committer and PMC member.
>>
>> Homogeneous Developers
>>
>> The current core developers are all from LinkedIn. DataFu has also received
>> contributions from other corporations such as Cloudera. Two of these
>> developers are among the Initial Committers listed below. We hope to
>> establish a developer community that includes contributors from several
>> other corporations and we are actively encouraging new contributors via
>> presentations and blog posts.
>>
>> Reliance on Salaried Developers
>>
>> The current core developers are salaried employees of LinkedIn, however
>> they are not paid specifically to work on DataFu. Contributions to DataFu
>> arise from the developers solving problems they encounter in their various
>> projects. The purpose of DataFu is to share these solutions so that others
>> may benefit and build a community of developers striving to solve common
>> problems together. Furthermore, once the project has a community built
>> around it, we expect to get committers, developers and contributions from
>> outside the current core developers.
>>
>> Relationships with Other Apache Products
>>
>> DataFu is deeply integrated with Apache products. It began as a library of
>> user-defined functions for Apache Pig. It has grown to also include Hadoop
>> jobs for incremental data processing and in the future will include code
>> for other higher level languages built on top of Apache Hadoop.
>>
>> An Excessive Obsession with the Apache Brand
>>
>> While we respect the reputation of the Apache brand and have no doubts that
>> it will attract contributors and users, our interest is primarily to give
>> DataFu a solid home as an open source project following an established
>> development model.
>>
>> Documentation
>>
>> Information on DataFu can be found at:
>>
>> https://github.com/LinkedIn/DataFu/blob/master/README.md
>>
>> Initial Source
>>
>> The initial source is available at:
>>
>> https://github.com/LinkedIn/DataFu
>>
>> Source and Intellectual Property Submission Plan
>>
>>     The DataFu library source code, available on GitHub.
>>
>> External Dependencies
>>
>> The initial source has the following external dependencies that are either
>> included in the final DataFu library or required in order to use it:
>>
>>     fastutil (Apache 2.0)
>>     joda-time (Apache 2.0)
>>     commons-math (Apache 2.0)
>>     guava (Apache 2.0)
>>     stream (Apache 2.0)
>>     jsr-305 (BSD)
>>     log4j (Apache 2.0)
>>     json (The JSON License)
>>     avro (Apache 2.0)
>>
>> In addition, the following external libraries are used either in building,
>> developing, or testing the project:
>>
>>     pig (Apache 2.0)
>>     hadoop (Apache 2.0)
>>     jline (BSD)
>>     antlr (BSD)
>>     commons-io (Apache 2.0)
>>     testng (Apache 2.0)
>>     maven (Apache 2.0)
>>     jsr-311 (CDDL-1.0)
>>     slf4j (MIT)
>>     eclipse (Eclipse Public License 1.0)
>>     autojar (GPLv2)
>>     jarjar (Apache 2.0)
>>
>> Cryptography
>>
>> DataFu has user-defined functions that use MD5 and SHA provided by Java’s
>> java.security.MessageDigest.
>>
>> Required Resources
>>
>> Mailing Lists
>>
>> DataFu-private for private PMC discussions (with moderated subscriptions)
>> DataFu-dev DataFu-commits
>>
>> Subversion Directory
>>
>> Git is the preferred source control system: git://git.apache.org/DataFu
>>
>> Issue Tracking
>>
>> JIRA DataFu (DataFu)
>>
>> Other Resources
>>
>> The existing code already has unit tests, so we would like a Hudson
>> instance to run them whenever a new patch is submitted. This can be added
>> after project creation.
>>
>> Initial Committers
>>
>>     Matthew Hayes
>>     William Vaughan
>>     Evion Kim
>>     Sam Shah
>>     Xiangrui Meng
>>     Christopher Lloyd
>>     Mathieu Bastian
>>     Mitul Tiwari
>>     Josh Wills
>>     Jarek Jarcec Cecho
>>
>> Affiliations
>>
>>     Matthew Hayes (LinkedIn)
>>
>>     William Vaughan (LinkedIn)
>>
>>     Evion Kim (LinkedIn)
>>
>>     Sam Shah (LinkedIn)
>>
>>     Xiangrui Meng (LinkedIn)
>>
>>     Christopher Lloyd (LinkedIn)
>>
>>     Mathieu Bastian (LinkedIn)
>>
>>     Mitul Tiwari (LinkedIn)
>>     Josh Wills (Cloudera)
>>     Jarek Jarcec Cecho (Cloudera)
>>
>> Sponsors
>>
>> Champion
>>
>> Jakob Homan (Apache Member)
>>
>> Nominated Mentors
>>
>>     Ashutosh Chauhan <hashutosh at apache dot org>
>>
>>     Roman Shaposhnik <rvs at apache dot org>
>>
>>     Ted Dunning <tdunning at apache dot org>
>>
>> Sponsoring Entity
>>
>> We are requesting the Incubator to sponsor this project.
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message