incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Incubator Wiki] Update of "MADlibProposal" by RomanShaposhnik
Date Tue, 01 Sep 2015 16:27:24 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "MADlibProposal" page has been changed by RomanShaposhnik:
https://wiki.apache.org/incubator/MADlibProposal?action=diff&rev1=1&rev2=2

  == Abstract ==
  MADlib is an open-source library (licensed under 2-clause BSD license) for scalable in-database
analytics. It provides data-parallel implementations of mathematical, statistical and machine
learning methods for structured and unstructured data. The MADlib mission is to foster widespread
development of scalable analytic skills, by harnessing efforts from commercial practice, academic
research, and open source development.
  
- MADlib occupies a unique niche in the realm of data science and machine learning libraries
since its SQL APIs allow it to work on a wide range of data stores and SQL engines.
+ MADlib occupies a unique niche in the realm of data science and machine learning libraries
since its SQL APIs can allow it to work on a wide range of data stores and SQL engines.
  
  == Proposal ==
  The current open source community behind MADlib feels that aligning itself with HAWQ's community,
governance model, infrastructure and roadmap will allow the project to accelerate adoption
and community growth. Given HAWQ's trajectory of entering Apache Software Foundation family
as an Incubating project, we feel that the best course of action for MADlib is to follow a
similar route.
  
- MADlib and HAWQ are complementary technologies in that MADlib in-database analytical functions
can run within the HAWQ execution engine. (MADlib also runs on Greenplum Database and PostreSQL
today.) It is expected that contributors to MADlib will be cognizant of the HAWQ ASF project
and may contribute to it as well.  Contributors may also look at the HAWQ project as a starting
port for ports to other parallel database engines. In short, collaboration between the two
communities will make both projects more vibrant and advance the respective technologies in
potentially novel directions.
+ MADlib and HAWQ are complementary technologies in that MADlib in-database analytical functions
can run within the HAWQ execution engine. (MADlib also runs on Greenplum Database and PostgreSQL
today.) It is expected that contributors to MADlib will be cognizant of the HAWQ ASF project
and may contribute to it as well.  In short, collaboration between the two communities will
make both projects more vibrant and advance the respective technologies in potentially novel
directions.
  
+ Contributors may also look at the HAWQ project as a starting port for ports to other parallel
database engines. This proposal highly encourages this type of work as it would help to further
realize the original cross-platform goal of MADlib as envisioned by its originators.
+ 
- Thus, the goal of this proposal is to bring the existing MADlib open source community into
ASF, change the project's governance model to be the "Apache Way" and transition project's
codebase and infrastructure into ASF INFRA. The community has agreed to transfer the brand
name "MADlib" to Apache Software Foundation as well.
+ Thus, the goal of this proposal is to bring the existing MADlib open source community into
ASF, change the project's governance model to the "Apache Way" and transition the project's
codebase and infrastructure into ASF INFRA. The community has agreed to transfer the brand
name "MADlib" to Apache Software Foundation as well.
  
  Pivotal Inc. on behalf of the MADlib open source community is submitting this proposal to
transition source code and associated artifacts (documentation, web site content, wiki, etc.)
to the Apache Software Foundation Incubator under the Apache License, Version 2.0 and is asking
Incubator PMC to established a MADlib incubating project.
  
+ Currently MADlib uses a few category X licensed software tools during its build (mostly
for generating documentation):
+    * doxypy 0.4.2 (GPL)
+    * doxygen 1.8.4 (GPL) 
+    * TikZ-UML
+    * bison 2.4 (GPL, with an exception for generated output)
+ We feel that this usage is compatible with an overall project licensed under the ALv2 and
don't anticipate any changes.
+ Our usage of LGPL library cern_root-5.34 is expected to go away since the 2 cern modules
used are being entirely re-written 
+ in MADlib
+ 
+ Finally, MADlib inclusion of MPL licensed library (eigen 3.2.2) into its binary artifact
seems to be consistent with
+ ASF recommendation for managing "weak copyleft" dependencies.
+   
+ 
  == Background ==
- MADlib grew out of discussions between database-engine developers, data scientists, IT architects
and academics interested in new approaches to scalable, sophisticated in-database analytics.
These discussions were written up in a paper in VLDB 2009 that coined the term “MAD Skills”
for data analysis. The MADlib software project began the following year as a collaboration
between researchers at UC Berkeley and engineers and data scientists at Pivotal (former EMC/Greenplum).
 
+ MADlib grew out of discussions between database engine developers, data scientists, IT architects
and academics interested in new approaches to scalable, sophisticated in-database analytics.
These discussions were written up in a paper in VLDB 2009 that coined the term “MAD Skills”
for data analysis (http://dl.acm.org/citation.cfm?id=1687576). The MADlib software project
began the following year as a collaboration between researchers at UC Berkeley and engineers
and data scientists at Pivotal (former EMC/Greenplum).  
  
- The initial MADlib codebase came from EMC/Greenplum, UC Berkeley, the University of Wisconsin,
and the University of Florida.  Today MADlib has contributors from around the world including
both individuals and institutions.  For example, recent contributions have come from Pivotal,
Stanford University, and the University of Illinois at Chicago.  
+ The initial MADlib codebase came from EMC/Greenplum, UC Berkeley, the University of Wisconsin,
and the University of Florida.  The project was publicly documented in a paper at VLDB 2012
(http://vldb.org/pvldb/vol5/p1700_joehellerstein_vldb2012.pdf).  Today MADlib has contributors
from around the world including both individuals and institutions.  For example, recent contributions
have come from Pivotal, Stanford University, and the University of Illinois at Chicago.  
  
- MADlib was conceived from the outset as a free, open source library for all to use and contribute
to.  Since 2009, the community has steadily added new methods in the areas of mathematics,
statistics, machine learning, and data transformation.  The current library includes over
30 principle algorithms as well as many additional operators and utility functions.
+ MADlib was conceived from the outset as a free, open source library for all to use and contribute
to.  Since its inception, the community has steadily added new methods in the areas of mathematics,
statistics, machine learning, and data transformation.  The current library includes over
30 principle algorithms as well as many additional operators and utility functions.
  
  The methods in MADlib are designed both for in- or out-of-core execution, and for the shared-nothing,
scale-out parallelism offered by modern parallel database engines, ensuring that computation
is done close to the data. The core functionality is written in declarative SQL statements,
which orchestrate data movement to and from disk, and across networked machines. Single-node
inner loops take advantage of SQL extensibility to call out to high performance math libraries
in user-defined scalar and aggregate functions. At the highest level, tasks that require iteration
and/or structure definition are coded in Python driver routines, which are used only to kick
off the data-rich computations that happen within the database engine.
  
@@ -37, +52 @@

  Given the high velocity of innovation happening in the underlying Hadoop ecosystem, any
SQL-based predictive analytics technology that plays in this ecosystem must be commensurately
agile to keep up with the community. We strongly believe that in the Big Data space, this
can be optimally achieved through a vibrant, diverse, self-governed community collectively
innovating around a single codebase while at the same time cross-pollinating with various
other data management communities. Apache Software Foundation is the ideal place to meet those
ambitious goals.
  
  == Initial Goals ==
- Our initial goals are to bring MADlib into the ASF, transition the engineering and governance
processes to be in accordance with the "Apache Way" and foster a collaborative development
model closely aligned with that of HAWQ.
+ Our initial goals are to bring MADlib into the ASF, transition the engineering and governance
processes to be in accordance with the "Apache Way" and foster a collaborative development
model closely aligned with that of HAWQ.  
+ 
+ Another important goal is encouraging efforts to port to other execution engines.
  
  The MADlib project will continue developing new functionality in an open, community-driven
way. We envision accelerating innovation under ASF governance, in order to meet the requirements
of a wide variety of predictive analytics use cases.
  
@@ -73, +90 @@

  
  === Orphaned products ===
  The community proposing MADlib for incubation is an independent open source community. Even
though Pivotal happens to be the biggest corporate sponsor of the project (by means of employing
the core team) the community goes beyond those affiliated with Pivotal. On top of that, Pivotal
is fully committed to maintain its position as one of the leading providers of SQL-based analytics
aimed squarely at data scientists. MADlib is the only game in town that can leverage SQL APIs
ranging from traditional RDBMS technology all the way to data warehousing (Pivotal Greenplum
Database) and into SQL-on-Hadoop (HAWQ). Moreover, Pivotal has a vested interest in making
MADlib succeed by driving its close integration with sister ASF projects. We expect this to
further reduces the risk of orphaning the product.
+ 
+ Even in the absence of support by a particular vendor such as Pivotal, and in a worst-case
scenario where HAWQ and Greenplum Database fail to gain traction in OSS, the existence of
an established PostgreSQL OSS project means there’s will still be a working stack for MADlib.
  
  === Inexperience with Open Source ===
  MADlib has been an open source project from the outset. All developers working on the project
(regardless of their employment affiliation) did so completely in the open. While the governance
model of MADlib has been more of a benevolent dictator model, the project has always been
receptive to accepting contributions from all sources and including them in future releases
based on thorough code review, testing, and compliance with the project’s coding best practices.
@@ -98, +117 @@

  Initial source code is available at: 
     * MADlib: https://github.com/madlib/madlib
     * Testsuite: https://github.com/madlib/testsuite
-    * Contribs: https://github.com/madlib/contrib
+    * Contributors: https://github.com/madlib/contrib
  
  The code is currently licensed under 2-clause BSD license.
  
@@ -109, +128 @@

  
  Runtime dependencies:
     * boost-1.47.0 (Boost Software License)
-    * _m_widen_init (license TBD)
+    * _m_widen_init (MIT for this subcomponent of GCC)
     * python-argparse-1.2.1 (PSF LICENSE AGREEMENT FOR PYTHON 2.7.1)
     * pyyaml-3.10 (MIT license)
-    * cern_root-5.34 (LGPL)
+    * cern_root-5.34 (LGPL, however this dependency will be removed since the 2 cern modules
used are being entirely re-written in MADlib)
     * eigen-3.2.2 (Mozilla Public License)
     * pyxb-1.2.4 (Apache license version 2)
     * python (Python Software Foundation License Version 2)
+    * mathjax-2.5 (Apache license version 2)
  
  Build only dependencies:
     * doxypy-0.4.2 (GPL) 

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org


Mime
View raw message