incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrei Savu <savu.and...@gmail.com>
Subject Re: [VOTE] Accept DataFu into the Incubator
Date Sat, 04 Jan 2014 00:19:27 GMT
+1 (binding)

-- Andrei Savu


On Tue, Dec 31, 2013 at 12:39 PM, Jakob Homan <jghoman@gmail.com> wrote:

> Incubator-
>
> Following the discussion earlier, I'm calling a vote to accept DataFu as a
> new Incubator project.
>
> The proposal draft is available at:
> https://wiki.apache.org/incubator/DataFuProposal, and is also included
> below.
>
> Vote is open for at least 96h and closes at the earliest on 4 Jan 13:00
> PDT.  I'm letting the vote run an extra day as we're in the holiday season.
>
> [ ] +1 accept DataFu in the Incubator
> [ ] +/-0
> [ ] -1 because...
>
> Here's my binding +1.
> -Jakob
>
> -------------------------------
> Abstract
>
> DataFu makes it easier to solve data problems using Hadoop and higher level
> languages based on it.
>
> Proposal
>
> DataFu provides a collection of Hadoop MapReduce jobs and functions in
> higher level languages based on it to perform data analysis. It provides
> functions for common statistics tasks (e.g. quantiles, sampling), PageRank,
> stream sessionization, and set and bag operations. DataFu also provides
> Hadoop jobs for incremental data processing in MapReduce.
>
> Background
>
> DataFu began two years ago as set of UDFs developed internally at LinkedIn,
> coming from our desire to solve common problems with reusable components.
> Recognizing that the community could benefit from such a library, we added
> documentation, an extensive suite of unit tests, and open sourced the code.
> Since then there have been steady contributions to DataFu as we encountered
> common problems not yet solved by it. Others outside LinkedIn have
> contributed as well. More recently we recognized the challenges with
> efficient incremental processing of data in Hadoop and have contributed a
> set of Hadoop MapReduce jobs as a solution.
>
> DataFu began as a project at LinkedIn, but it has shown itself to be useful
> to other organizations and developers as well as they have faced similar
> problems. We would like to share DataFu with the ASF and begin developing a
> community of developers and users within Apache.
>
> Rationale
>
> There is a strong need for well tested libraries that help developers solve
> common data problems in Hadoop and higher level languages such as Pig,
> Hive, Crunch, Scalding, etc.
>
> Current Status
>
> Meritocracy
>
> Our intent with this incubator proposal is to start building a diverse
> developer community around DataFu following the Apache meritocracy model.
> Since DataFu was initially open sourced in 2011, it has received
> contributions from both within and outside LinkedIn. We plan to continue
> support for new contributors and work with those who contribute
> significantly to the project to make them committers.
>
> Community
>
> DataFu has been building a community of developers for two years. It began
> with contributors from LinkedIn and has received contributions from
> developers at Cloudera since very early on. It has been included included
> in Cloudera’s Hadoop Distribution and Apache Bigtop. We hope to extend our
> contributor base significantly and invite all those who are interested in
> solving large-scale data processing problems to participate.
>
> Core Developers
>
> DataFu has a strong base of developers at LinkedIn. Matthew Hayes initiated
> the project in 2011, and aside from continued contributions to DataFu has
> also contributed the sub-project Hourglass for incremental MapReduce
> processing. Separate from DataFu he has also open sourced the White
> Elephant project. Sam Shah contributed a significant portion of the
> original code and continues to contribute to the project. William Vaughan
> has been contributing regularly to DataFu for the past two years. Evion Kim
> has been contributing to DataFu for the past year. Xiangrui Meng recently
> contributed implementations of scalable sampling algorithms based on
> research from a paper he published. Chris Lloyd has provided some important
> bug fixes and unit tests. Mitul Tiwari has also contributed to DataFu.
> Mathieu Bastian has been developing MapReduce jobs that we hope to include
> in DataFu. In addition he also leads the open source Gephi project.
>
> Alignment
>
> The ASF is the natural choice to host the DataFu project as its goal of
> encouraging community-driven open-source projects fits with our vision for
> DataFu. Additionally, other projects DataFu integrates with, such as Apache
> Pig and Apache Hadoop, and in the future Apache Hive and Apache Crunch, are
> hosted by the ASF and we will benefit and provide benefit by close
> proximity to them.
>
> Known Risks
>
> Orphaned Products
>
> The core developers have been contributing to DataFu for the past two
> years. There is very little risk of DataFu being abandoned given its
> widespread use within LinkedIn.
>
> Inexperience with Open Source
>
> DataFu was started as an open source project in 2011 and has remained so
> for two years. Matt initiated the project, and additionally is the creator
> of the open source White Elephant project. He has also contributed patches
> to Apache Pig. Most recently he has released Hourglass as a sub-project of
> DataFu. Sam contributed much of the original code and continues to
> contribute to the project. Will has been contributing to DataFu since it
> was first open sourced. Evion has been contributing for the past year.
> Mathieu leads the open source Gephi project. Jakob has been actively
> involved with the ASF as a full-time Hadoop committer and PMC member.
>
> Homogeneous Developers
>
> The current core developers are all from LinkedIn. DataFu has also received
> contributions from other corporations such as Cloudera. Two of these
> developers are among the Initial Committers listed below. We hope to
> establish a developer community that includes contributors from several
> other corporations and we are actively encouraging new contributors via
> presentations and blog posts.
>
> Reliance on Salaried Developers
>
> The current core developers are salaried employees of LinkedIn, however
> they are not paid specifically to work on DataFu. Contributions to DataFu
> arise from the developers solving problems they encounter in their various
> projects. The purpose of DataFu is to share these solutions so that others
> may benefit and build a community of developers striving to solve common
> problems together. Furthermore, once the project has a community built
> around it, we expect to get committers, developers and contributions from
> outside the current core developers.
>
> Relationships with Other Apache Products
>
> DataFu is deeply integrated with Apache products. It began as a library of
> user-defined functions for Apache Pig. It has grown to also include Hadoop
> jobs for incremental data processing and in the future will include code
> for other higher level languages built on top of Apache Hadoop.
>
> An Excessive Obsession with the Apache Brand
>
> While we respect the reputation of the Apache brand and have no doubts that
> it will attract contributors and users, our interest is primarily to give
> DataFu a solid home as an open source project following an established
> development model.
>
> Documentation
>
> Information on DataFu can be found at:
>
> https://github.com/LinkedIn/DataFu/blob/master/README.md
>
> Initial Source
>
> The initial source is available at:
>
> https://github.com/LinkedIn/DataFu
>
> Source and Intellectual Property Submission Plan
>
>     The DataFu library source code, available on GitHub.
>
> External Dependencies
>
> The initial source has the following external dependencies that are either
> included in the final DataFu library or required in order to use it:
>
>     fastutil (Apache 2.0)
>     joda-time (Apache 2.0)
>     commons-math (Apache 2.0)
>     guava (Apache 2.0)
>     stream (Apache 2.0)
>     jsr-305 (BSD)
>     log4j (Apache 2.0)
>     json (The JSON License)
>     avro (Apache 2.0)
>
> In addition, the following external libraries are used either in building,
> developing, or testing the project:
>
>     pig (Apache 2.0)
>     hadoop (Apache 2.0)
>     jline (BSD)
>     antlr (BSD)
>     commons-io (Apache 2.0)
>     testng (Apache 2.0)
>     maven (Apache 2.0)
>     jsr-311 (CDDL-1.0)
>     slf4j (MIT)
>     eclipse (Eclipse Public License 1.0)
>     autojar (GPLv2)
>     jarjar (Apache 2.0)
>
> Cryptography
>
> DataFu has user-defined functions that use MD5 and SHA provided by Java’s
> java.security.MessageDigest.
>
> Required Resources
>
> Mailing Lists
>
> DataFu-private for private PMC discussions (with moderated subscriptions)
> DataFu-dev DataFu-commits
>
> Subversion Directory
>
> Git is the preferred source control system: git://git.apache.org/DataFu
>
> Issue Tracking
>
> JIRA DataFu (DataFu)
>
> Other Resources
>
> The existing code already has unit tests, so we would like a Hudson
> instance to run them whenever a new patch is submitted. This can be added
> after project creation.
>
> Initial Committers
>
>     Matthew Hayes
>     William Vaughan
>     Evion Kim
>     Sam Shah
>     Xiangrui Meng
>     Christopher Lloyd
>     Mathieu Bastian
>     Mitul Tiwari
>     Josh Wills
>     Jarek Jarcec Cecho
>
> Affiliations
>
>     Matthew Hayes (LinkedIn)
>
>     William Vaughan (LinkedIn)
>
>     Evion Kim (LinkedIn)
>
>     Sam Shah (LinkedIn)
>
>     Xiangrui Meng (LinkedIn)
>
>     Christopher Lloyd (LinkedIn)
>
>     Mathieu Bastian (LinkedIn)
>
>     Mitul Tiwari (LinkedIn)
>     Josh Wills (Cloudera)
>     Jarek Jarcec Cecho (Cloudera)
>
> Sponsors
>
> Champion
>
> Jakob Homan (Apache Member)
>
> Nominated Mentors
>
>     Ashutosh Chauhan <hashutosh at apache dot org>
>
>     Roman Shaposhnik <rvs at apache dot org>
>
>     Ted Dunning <tdunning at apache dot org>
>
> Sponsoring Entity
>
> We are requesting the Incubator to sponsor this project.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message