incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Atri Sharma <a...@apache.org>
Subject Re: [DISCUSS] MADlib Incubation Proposal
Date Thu, 03 Sep 2015 00:16:39 GMT
I am very happy to see this proposal.

I think combination of HAWQ and MADlib makes it possible to have heavy
production level analytics on top of Hadoop (which is fantastic!).

That said, given MADlib 's flexibility, I feel it would be a great addition
to Apache big data stack in general and I am eagerly looking forward to
integration efforts with various existing Apache big data members.

Regards,

Atri
On 3 Sep 2015 02:07, "Roman Shaposhnik" <rvs@apache.org> wrote:

> Hi!
>
> on the heels of the HAWQ proposal, I'd like
> to follow with a discussion of accepting MADlib's
> community into the ASF Incubator:
>      https://wiki.apache.org/incubator/MADlibProposal
>
> There was an extensive discussion within the existing
> open source community and the overall consensus
> is extremely supportive of this proposal:
>     http://madlib.net/pipermail/user/2015-August/
>     http://madlib.net/pipermail/devel/2015-August/
>
> We've done quite a bit of outreach in order to identify
> all the folks who may be interested in joining the initial
> list of committers. The current proposal reflects that.
> Additionally, we hope that the ASF DISCUSS thread
> will help us in reaching out even further.
>
> Finally, while 3 experienced mentors currently mentioned
> on the proposal seems like a reasonable number, we would
> love if other folks from IPMC could volunteer to help us on
> this journey.
>
> Thanks,
> Roman.
>
> == Abstract ==
> MADlib is an open-source library (licensed under 2-clause BSD license)
> for scalable in-database analytics. It provides data-parallel
> implementations of mathematical, statistical and machine learning
> methods for structured and unstructured data. The MADlib mission is to
> foster widespread development of scalable analytic skills, by
> harnessing efforts from commercial practice, academic research, and
> open source development.
>
> MADlib occupies a unique niche in the realm of data science and
> machine learning libraries since its SQL APIs can allow it to work on
> a wide range of data stores and SQL engines.
>
> == Proposal ==
> The current open source community behind MADlib feels that aligning
> itself with HAWQ's community, governance model, infrastructure and
> roadmap will allow the project to accelerate adoption and community
> growth. Given HAWQ's trajectory of entering Apache Software Foundation
> family as an Incubating project, we feel that the best course of
> action for MADlib is to follow a similar route.
>
> MADlib and HAWQ are complementary technologies in that MADlib
> in-database analytical functions can run within the HAWQ execution
> engine. (MADlib also runs on Greenplum Database and PostgreSQL today.)
> It is expected that contributors to MADlib will be cognizant of the
> HAWQ ASF project and may contribute to it as well.  In short,
> collaboration between the two communities will make both projects more
> vibrant and advance the respective technologies in potentially novel
> directions.
>
> Contributors may also look at the HAWQ project as a starting port for
> ports to other parallel database engines. This proposal highly
> encourages this type of work as it would help to further realize the
> original cross-platform goal of MADlib as envisioned by its
> originators.
>
> Thus, the goal of this proposal is to bring the existing MADlib open
> source community into ASF, change the project's governance model to
> the "Apache Way" and transition the project's codebase and
> infrastructure into ASF INFRA. The community has agreed to transfer
> the brand name "MADlib" to Apache Software Foundation as well.
>
> Pivotal Inc. on behalf of the MADlib open source community is
> submitting this proposal to transition source code and associated
> artifacts (documentation, web site content, wiki, etc.) to the Apache
> Software Foundation Incubator under the Apache License, Version 2.0
> and is asking Incubator PMC to established a MADlib incubating
> project.
>
> Currently MADlib uses a few category X licensed software tools during
> its build (mostly for generating documentation):
>    * doxypy 0.4.2 (GPL)
>    * doxygen 1.8.4 (GPL)
>    * TikZ-UML
>    * bison 2.4 (GPL, with an exception for generated output)
> We feel that this usage is compatible with an overall project licensed
> under the ALv2 and don't anticipate any changes.
> Our usage of LGPL library cern_root-5.34 is expected to go away since
> the 2 cern modules used are being entirely re-written
> in MADlib
>
> Finally, MADlib inclusion of MPL licensed library (eigen 3.2.2) into
> its binary artifact seems to be consistent with
> ASF recommendation for managing "weak copyleft" dependencies.
>
>
> == Background ==
> MADlib grew out of discussions between database engine developers,
> data scientists, IT architects and academics interested in new
> approaches to scalable, sophisticated in-database analytics. These
> discussions were written up in a paper in VLDB 2009 that coined the
> term “MAD Skills” for data analysis
> (http://dl.acm.org/citation.cfm?id=1687576). The MADlib software
> project began the following year as a collaboration between
> researchers at UC Berkeley and engineers and data scientists at
> Pivotal (former EMC/Greenplum).
>
> The initial MADlib codebase came from EMC/Greenplum, UC Berkeley, the
> University of Wisconsin, and the University of Florida.  The project
> was publicly documented in a paper at VLDB 2012
> (http://vldb.org/pvldb/vol5/p1700_joehellerstein_vldb2012.pdf).  Today
> MADlib has contributors from around the world including both
> individuals and institutions.  For example, recent contributions have
> come from Pivotal, Stanford University, and the University of Illinois
> at Chicago.
>
> MADlib was conceived from the outset as a free, open source library
> for all to use and contribute to.  Since its inception, the community
> has steadily added new methods in the areas of mathematics,
> statistics, machine learning, and data transformation.  The current
> library includes over 30 principle algorithms as well as many
> additional operators and utility functions.
>
> The methods in MADlib are designed both for in- or out-of-core
> execution, and for the shared-nothing, scale-out parallelism offered
> by modern parallel database engines, ensuring that computation is done
> close to the data. The core functionality is written in declarative
> SQL statements, which orchestrate data movement to and from disk, and
> across networked machines. Single-node inner loops take advantage of
> SQL extensibility to call out to high performance math libraries in
> user-defined scalar and aggregate functions. At the highest level,
> tasks that require iteration and/or structure definition are coded in
> Python driver routines, which are used only to kick off the data-rich
> computations that happen within the database engine.
>
> The first platforms supported by MADlib were Greenplum Database and
> PostgreSQL.  With the development of HAWQ SQL-on-Hadoop technology by
> Pivotal, MADlib offers a way to perform predictive analytics on very
> large data sets stored on a Hadoop cluster.
>
> Today, MADlib is in active development and is deployed on a wide
> variety of industry and academic projects across many different
> verticals.
>
> == Rationale ==
> Enterprises today are seeing the value of landing very large
> quantities of data in Hadoop clusters with the goal improving their
> products and processes.  With the proliferation of increasingly
> sophisticated SQL-on-Hadoop technologies such as HAWQ, analysts can
> use the familiar SQL language to query this data at scale.  This
> effectively opens the door to Hadoop in the enterprise.
>
> Adding SQL-based predictive analytics like MADlib to the equation
> enables organizations to reason across large data sets without
> resorting to sampling, which has been a traditional approach when
> confronted with scale problems.  Operating on all of the data with
> MADlib results in more robust and accurate models.
>
> Since MADlib is a SQL-based interface, organizations do not need to
> re-train their teams on an unfamiliar programming language since SQL
> skills are ubiquitous in today's enterprises.
>
> Given the high velocity of innovation happening in the underlying
> Hadoop ecosystem, any SQL-based predictive analytics technology that
> plays in this ecosystem must be commensurately agile to keep up with
> the community. We strongly believe that in the Big Data space, this
> can be optimally achieved through a vibrant, diverse, self-governed
> community collectively innovating around a single codebase while at
> the same time cross-pollinating with various other data management
> communities. Apache Software Foundation is the ideal place to meet
> those ambitious goals.
>
> == Initial Goals ==
> Our initial goals are to bring MADlib into the ASF, transition the
> engineering and governance processes to be in accordance with the
> "Apache Way" and foster a collaborative development model closely
> aligned with that of HAWQ.
>
> Another important goal is encouraging efforts to port to other
> execution engines.
>
> The MADlib project will continue developing new functionality in an
> open, community-driven way. We envision accelerating innovation under
> ASF governance, in order to meet the requirements of a wide variety of
> predictive analytics use cases.
>
> We will also require transitioning of existing project infrastructure
> (source code, JIRA, mailing list) to the ASF infrastructure.
>
> == Current Status ==
> Currently, the project is available at http://madlib.net/. The
> codebase is licensed under the a 2-clause BSD license. Our current
> governance model could be described as a "benevolent dictator" one. As
> stated above, the existing MADlib community feels that closer
> alignment with HAWQ community, infrastructure and the governance model
> as it is being proposed to ASF will allow MADlib project to thrive
> much more compared to relative isolation from HAWQ.
>
> === Meritocracy ===
> Our proposed list of initial committers include the current MADlib R&D
> team at Pivotal and existing active members of the open source
> project. This group will form a base for the broader community we will
> invite to collaborate on the codebase. We intend to radically expand
> the initial developer and user community by running the project in
> accordance with the "Apache Way". Users and new contributors will be
> treated with respect and welcomed. By participating in the community
> and providing quality patches/support that move the project forward,
> they will earn merit. They also will be encouraged to provide non-code
> contributions (documentation, events, community management, etc.) and
> will gain merit for doing so. Those with a proven support and quality
> track record will be encouraged to become committers.
>
> === Community ===
> If MADlib is accepted for incubation, the primary initial goal will be
> transitioning the core community towards embracing the Apache Way of
> project governance. We would solicit major existing contributors to
> become committers on the project from the start.
>
> === Core Developers ===
> MADlib core developers are skilled in working as part of openly
> governed communities. That said, most of the core developers are
> currently NOT affiliated with the ASF and would require new ICLAs
> before committing to the project.
>
> === Alignment ===
> The following existing ASF projects can be considered when reviewing
> the MADlib proposal:
>
> Apache Mahout project's goal is to build an environment for quickly
> creating scalable performant machine learning applications. Apache
> Mahout is, perhaps, the oldest machine learning library in Hadoop
> ecosystem. The three major components of Mahout are an environment for
> building scalable algorithms, many new Scala + Spark (H2O in progress)
> algorithms, and Mahout's mature Hadoop MapReduce algorithms. We see
> the two projects benefiting from each other's experience of
> implementing similar classes of algorithms and look forward to a
> fruitful exchange of ideas between the two communities.
>
> Apache Spark is a fast engine for processing large datasets, typically
> from a Hadoop cluster, and performing batch, streaming, interactive,
> or machine learning workloads.  Recently, Apache Spark has embraced
> SQL-like APIs around DataFrames at its core. Because of that we would
> expect a level of collaboration between the two projects. Spark
> project also contains a library (MLlib) that is the closest cousin to
> MADlib. MLlib is Apache Spark's scalable machine learning library. We
> see the two projects benefiting from each other's experience of
> implementing similar classes of algorithms and look forward to a
> fruitful exchange of ideas between the two communities.
>
> Apache Hive is a data warehouse software that facilitates querying and
> managing large datasets residing in distributed storage. Hive provides
> a mechanism to project structure onto this data and query the data
> using a SQL-like language called HiveQL. We see a potential for MADlib
> to leverage Hive as a backend the same way it currently leverages
> PostgreSQL-derived SQL backends. This could be especially useful for
> longer running algorithms.
>
> Apache Drill is a schema-free SQL query engine for Hadoop, NoSQL and
> Cloud Storage. We see a potential for MADlib to leverage Drill as a
> backend the same way it currently leverages PostgreSQL-derived SQL
> backends. This could be especially useful for analyzing data coming
> from heterogenous sources and federated by the Drill engine.
>
> == Known Risks ==
> Development has been sponsored mostly by a single company (or its
> predecessors) thus far and coordinated mainly by the core Pivotal R&D
> team.
>
> So far, the project's governance model has explicitly been a
> "benevolent dictator" one. For the project to fully transition to the
> "Apache Way", development must shift towards the meritocracy-centric
> model of growing a community of contributors balanced with the needs
> for extreme stability and core implementation coherency.
>
> === Orphaned products ===
> The community proposing MADlib for incubation is an independent open
> source community. Even though Pivotal happens to be the biggest
> corporate sponsor of the project (by means of employing the core team)
> the community goes beyond those affiliated with Pivotal. On top of
> that, Pivotal is fully committed to maintain its position as one of
> the leading providers of SQL-based analytics aimed squarely at data
> scientists. MADlib is the only game in town that can leverage SQL APIs
> ranging from traditional RDBMS technology all the way to data
> warehousing (Pivotal Greenplum Database) and into SQL-on-Hadoop
> (HAWQ). Moreover, Pivotal has a vested interest in making MADlib
> succeed by driving its close integration with sister ASF projects. We
> expect this to further reduces the risk of orphaning the product.
>
> Even in the absence of support by a particular vendor such as Pivotal,
> and in a worst-case scenario where HAWQ and Greenplum Database fail to
> gain traction in OSS, the existence of an established PostgreSQL OSS
> project means there’s will still be a working stack for MADlib.
>
> === Inexperience with Open Source ===
> MADlib has been an open source project from the outset. All developers
> working on the project (regardless of their employment affiliation)
> did so completely in the open. While the governance model of MADlib
> has been more of a benevolent dictator model, the project has always
> been receptive to accepting contributions from all sources and
> including them in future releases based on thorough code review,
> testing, and compliance with the project’s coding best practices.
>
> === Homogeneous Developers ===
> While most of the initial committers are employed by Pivotal, there's
> still a healthy level of interest coming from academia. On top of that
> we expect to spark curiosity in sister ASF projects and attract
> developers unaffiliated with Pivotal. Finally, MADlib is being used
> extensively whenever Pivotal engages with customers on data science
> projects. This typically means that the skills remain within a
> customer organization which further increases the chance of turning
> customer data scientists into MADlib contributors.
>
> === Reliance on Salaried Developers ===
> A large percentage of the contributors are paid to work in the Big
> Data space. While they might wander from their current employers, they
> are unlikely to venture far from their core expertise and thus will
> continue to be engaged with the project regardless of their current
> employers. In addition, the project is still enjoying popularity in
> academic circles and we hope that will help mitigate reliance on
> salaried developers as well.
>
> === Relationships with Other Apache Products ===
> As mentioned in the Alignment section, MADlib may consider various
> degrees of integration and code exchange with Apache Spark (MLlib),
> Apache Mahout, Apache Hive and Apache Drill projects. We expect
> integration points to be inside and outside the project. We look
> forward to collaborating with these communities as well as other
> communities under the Apache umbrella.
>
> === An Excessive Fascination with the Apache Brand ===
> While we intend to leverage the Apache "brand" when talking to other
> projects as a testament to our project’s neutrality, we have no plans
> for making use of the Apache brand in press releases nor posting
> billboards advertising acceptance of MADlib into Apache Incubator.
>
> == Documentation ==
> The documentation is currently available at:
> https://github.com/madlib/frontpage
>
> The documentation is currently licensed under 2-clause BSD license.
>
> == Initial Source ==
> Initial source code is available at:
>    * MADlib: https://github.com/madlib/madlib
>    * Testsuite: https://github.com/madlib/testsuite
>    * Contributors: https://github.com/madlib/contrib
>
> The code is currently licensed under 2-clause BSD license.
>
> == Source and Intellectual Property Submission Plan ==
> As soon as MADlib is approved to join the Incubator, the source code
> will be transitioned via the Software Grant Agreement onto ASF
> infrastructure and in turn made available under the Apache License,
> version 2.0.  We know of no legal encumbrances that would inhibit the
> transfer of source code to the ASF.
>
> == External Dependencies ==
>
> Runtime dependencies:
>    * boost-1.47.0 (Boost Software License)
>    * _m_widen_init (MIT for this subcomponent of GCC)
>    * python-argparse-1.2.1 (PSF LICENSE AGREEMENT FOR PYTHON 2.7.1)
>    * pyyaml-3.10 (MIT license)
>    * cern_root-5.34 (LGPL, however this dependency will be removed
> since the 2 cern modules used are being entirely re-written in MADlib)
>    * eigen-3.2.2 (Mozilla Public License)
>    * pyxb-1.2.4 (Apache license version 2)
>    * python (Python Software Foundation License Version 2)
>    * mathjax-2.5 (Apache license version 2)
>
> Build only dependencies:
>    * doxypy-0.4.2 (GPL)
>    * cmake-2.8.4 (BSD 3-clause License)
>    * doxygen >= 1.8.4 (GPL)
>    * flex >= 2.5.33 (BSD)
>    * bison >= 2.4 (GPL)
>    * latex (LaTeX Project Public License)
>    * TikZ-UML (no license information)
>
> Cryptography
>    * N/A
>
> == Required Resources ==
>
> === Mailing lists ===
>   * private@madlib.incubator.apache.org (moderated subscriptions)
>   * commits@madlib.incubator.apache.org
>   * dev@madlib.incubator.apache.org
>   * issues@madlib.incubator.apache.org
>   * user@madlib.incubator.apache.org
>
> === Git Repository ===
> https://git-wip-us.apache.org/repos/asf/incubator-madlib.git
>
> === Issue Tracking ===
> JIRA Project MADlib (MADLIB)
>
> We will also request migration of our current JIRA available at
> http://jira.madlib.net/
>
> === Other Resources ===
>
> Means of setting up regular builds for MADlib on builds.apache.org
> will require integration with Docker support.
>
> == Initial Committers ==
>   * Anirudh Kondaveeti
>   * Caleb Welton
>   * Frank McQuillan
>   * Gang Xiong
>   * Gautam Muralidhar
>   * Hitoshi Harada
>   * Hulya Emir-farinas
>   * Ian Huston
>   * KeeSiong Ng
>   * Noel Sio
>   * Rahul Iyer
>   * Rashmi Raghu
>   * Regunathan Radhakrishnan
>   * Ronert Obst
>   * Samuel Ziegler
>   * Sarah Aerni
>   * Srivatsan Ramanujam
>   * Woo Jae Jung
>   * Xixuan Feng
>   * Yu Yang
>   * Atri Sharma
>   * Greg Chase
>   * Chloe Jackson
>   * Roman Shaposhnik
>   * Vaibhav Gumashta
>   * Ted Dunning
>   * Konstantin Boudnik
>
> == Affiliations ==
>   * Hortonworks: Vaibhav Gumashta
>   * MapR: Ted Dunning
>   * WANDisco: Konstantin Boudnik
>   * Barclays:  Atri Sharma
>   * Pivotal: everyone else on this proposal
>
> == Sponsors ==
>
> === Champion ===
> Roman Shaposhnik
>
> === Nominated Mentors ===
>
> The initial mentors are listed below:
>   * Ted Dunning - Apache Member, MapR
>   * Konstantin Boudnik - Apache Member, WANDisco
>   * Roman Shaposhnik - Apache Member, Pivotal
>
> === Sponsoring Entity ===
> We would like to propose Apache incubator to sponsor this project.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message