incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Makoto Yui <m...@treasure-data.com>
Subject Call for Mentors
Date Wed, 31 Aug 2016 04:30:53 GMT
As Roman mentioned, we welcome volunteering mentors.

Please find our proposal in
https://wiki.apache.org/incubator/HivemallProposal

Thanks,
Makoto

2016-08-31 11:28 GMT+09:00 Roman Shaposhnik <rvs@apache.org>:
> Hi!
>
> It seems that the discussion has converged and I'd like to
> make one extra call for volunteering mentors. Please let
> me know ASAP since I'd like to get the VOTE going tomorrow.
>
> Thanks,
> Roman.
>
> On Mon, Aug 22, 2016 at 10:20 AM, Roman Shaposhnik <rvs@apache.org> wrote:
>> Hi!
>>
>> on behalf of the Hivemall team, I'd like to kick off
>> a discussion thread around accepting Hivemall
>> into and ASF Incubator.
>>
>> Hivemall is a library for machine learning implemented
>> as Hive UDFs/UDAFs/UDTFs that runs on Hadoop-based d
>> ata processing frameworks. More specifically it runs currently
>> runs on Apache Hive, Apache Spark, and Apache Pig, that
>> support Hive UDFs as an extension mechanism.
>>
>> Here's the link to the proposal:
>>     https://wiki.apache.org/incubator/HivemallProposal
>> and the full text is also attached to this email.
>>
>> Two of the areas that I'd like to explicitly solicit IPMC's opinion
>> on are:
>>     1. whether the process of re-licensing from LGPL to ALv2
>>      was enough given the ASF's strict IP policies
>>
>>      2. whether the 5 initial committers make sense given that
>>      there's a total of 15 contributors as per GitHub stats.
>>
>> With that, thanks, in advance, for your time and let the discussion begin!
>>
>> Thanks,
>> Roman.
>>
>> == Abstract ==
>>
>> Hivemall is a library for machine learning implemented as Hive UDFs/UDAFs/UDTFs.
>>
>> Hivemall runs on Hadoop-based data processing frameworks, specifically
>> on Apache Hive, Apache Spark, and Apache Pig, that support Hive UDFs
>> as an extension mechanism.
>>
>> == Proposal ==
>>
>> Hivemall is a collection of machine learning algorithms and versatile
>> data analytics functions. It provides a number of ease of use machine
>> learning functionalities through user-defined function (UDF),
>> user-defined aggregate function (UDAFs), and/or user-defined table
>> generating functions (UDTFs) of Apache Hive. It offers a variety of
>> functionalities: regression, classification, recommendation, anomaly
>> detection, k-nearest neighbor, and feature engineering. Hivemall
>> supports state-of-the-art machine learning algorithms such as Soft
>> Confidence Weighted, Adaptive Regularization of Weight Vectors,
>> Factorization Machines, and AdaDelta. Hivemall is mainly designed to
>> run on Apache Hive but it also supports Apache Pig and Apache Spark
>> for the runtime.
>>
>> == Background ==
>>
>> Hivemall started as a research project of the main developer at
>> National Institute of Advanced Industrial Science and Technology
>> (AIST) in 2013 and the initial version was released on 2 Oct, 2013 on
>> Github: https://github.com/myui/hivemall.
>>
>> After the main developer moving to Treasure Data in 2015, the project
>> has been actively developed as an open source product and changed the
>> license from GNU LGPL v2.1 to Apache License v2 on Mar 16, 2015. The
>> project copyright holders agreed to change the license then.
>>
>> The community is growing incrementally and the project has 15
>> contributors, 431 stars, and 131 forks on Github as of Aug 15, 2016.
>> The project was awarded for the InfoWorld Bossie Awards (the best open
>> source big data tools) in 2014.
>>
>> Past main contributions by external contributors includes Apache Pig
>> supports from Daniel Dai (Hortonworks), Apache Spark porting and an
>> integration to Apache YARN from Takeshi Yamamuro (NTT). Hivemall was
>> originally designed for Apache Hive but it now supports Apache Spark
>> and Apache Pig.
>>
>> == Rationale ==
>>
>> User-defined function is a powerful mechanism to enrich the expressive
>> power of declarative query languages like SQL, HiveQL, PigLatin, Spark
>> SQL. Hive UDF interface is now becoming the de-facto standard for
>> SQL-on-Hadoop platforms; Apache Spark and Apache Pig have full
>> supports for Hive UDFs/UDAFs/UDTFs, and Apache Impala, Apache Drill,
>> and Apache Tajo also have limited supports for Hive UDFs/UDAFs.
>>
>> Hivemall can be considered as a cross platform library for machine
>> learning as Hivemall is implemented as cross platform Hive
>> UDFs/UDAFs/UDTFs; prediction models built by a batch query of Apache
>> Hive can be used on Apache Spark/Pig, and conversely, prediction
>> models build by Apache Spark can be used from Apache Hive/Pig.
>>
>> Several database vendors are trying to offer machine learning
>> functionality in relational databases, so that the costs of moving
>> data can be eliminated. Apache MADlib, a machine learning library for
>> HAWQ and PostgreSQL, is accepted as an Apache Incubator project.
>> MADlib is implemented using PostgreSQL UDF interface.
>>
>> Apache Hive has a JIRA ticket in HIVE-7940 to support machine learning
>> functionalities. So, we consider this proposal is useful for the
>> community. We consider that Hivemall is better to be a separated
>> project to the Apache Hive because 1) we target other data processing
>> frameworks such as Apache Spark as well for the runtime of Hivemall,
>> and 2) the current codebase is large enough to be separated.
>> Separation of concerns is good for project governance (e.g., release
>> management). For example, Apache Datafu is data mining and statistics
>> library for Apache Pig and a separated project to Apache Pig.
>>
>> We consider that Hivemall would be a similar position to Apache Datafu
>> but there are large differences in features and target runtimes.
>> The target runtime of Apache Datafu is Apache Pig but Hivemall targets
>> Apache Hive, Apache Spark, and Apache Pig for the target runtime.
>> Apache Datafu is more likely to be statistics library and does not
>> support machine learning features such as classification and
>> regression but Hivemall is a machine learning library supporting them.
>>
>> == Initial Goals ==
>>
>> The initial goals are as follows:
>>  * Establish the project governance in the Apache way and broaden the community
>>  * Improve documentations.
>>  * Adding more unit/scenario tests.
>>  * Handover of code and copyrights
>>
>> == Current Status ==
>>
>> Hivemall has several on-going WIP features.
>>
>> Making a parameter server (a kind of distributed key-value store) as
>> Apache YARN application is a major issue. Hivemall’s parameter server
>> is currently a standalone application. Parameter servers on Apache
>> YARN enables to use Hadoop cluster resource efficiently and makes
>> management of parameter servers easier.
>>
>> Another major WIP issue is integrating XGBoost into Hivemall. We need
>> more works and tests, e.g., supporting cross compilation of native JNI
>> objects of XGBoost.
>>
>> === Meritocracy ===
>>
>> The project members understand the importance of letting motivated
>> individuals contribute to the project. Since Hivemall was initially
>> released in 2014, it has received contributions from 14 contributors.
>>
>> Our intent of this incubator proposal is building a diverse developer
>> community following the Apache meritocracy model. We welcome external
>> contributions and plan to elect committers from those who contribute
>> significantly to the project.
>>
>> === Community ===
>>
>> While there are 15 contributors in total, there are 3-4 active
>> developers continuously involved for the major feature development at
>> the moment.  We hope to extend our contributor base and encourages
>> suggestions and contributions from any potential user.
>>
>> === Core Developers ===
>>
>> The current main developers are from employees of Treasure Data, NTT
>> and Hortonworks. Some of them are Hadoop/Pig PMCs and/or Hive
>> committers.
>>
>> === Alignment ===
>>
>> Incubating at ASF is the natural choice for the Hivemall project
>> because the Hivemall is targeting to run on Apache Hive, Apache Spark,
>> and Apache Pig. We encourage integrations with other ASF data
>> processing frameworks like Apache Impala and Apache Drill.
>>
>> == Known Risks ==
>>
>> The contributions of the main developer is significant at the moment
>> but the dependencies would decrease as the community grows.
>>
>> === Orphaned products ===
>>
>> While the main developer is developing Hivemall as a full-time job at
>> TreasureData, the company is well being aware of the open source
>> philosophy and the importance of open governance of open source
>> products. Orphanining ASF product can be considered itself as a risk.
>> Hence, we think the the risks of it being orphaned are minimal.
>>
>> === Inexperience with Open Source ===
>>
>> Hivemall also has been developed as an open source project since 2013.
>> The majority of the project member have jobs developing open source
>> products and some of them are working on other ASF projects like
>> Apache Hadoop and Apache Pig. We thus considered that the project
>> members have enough experiences for open source development.
>>
>> === Homogenous Developers ===
>>
>> The current list of committers consists of developers from three
>> different companies. The committers are geographically distributed
>> across the U.S. and Asia. They are experienced with working in a
>> distributed environment.
>>
>> While not included in the initial committer, there are other external
>> contributors to the project. So, we hope to establish a developer
>> community that includes those contributors from several other
>> corporations during the incubation process.
>>
>> === Reliance on Salaried Developers ===
>>
>> The major developer is paid by his employer to contribute to this
>> project and the other developers are payed by their employers for
>> Hadoop-related open source development. While they might change their
>> affiliations over time, they are willing to have their expertise for
>> the open source development. So, the project would continue regardless
>> their affiliations.
>>
>> === Relationships with Other Apache Products ===
>>
>> Hivemall is a collection for machine learning functions on Apache
>> Hive, Apache Spark, and Apache Pig. Apache MADlib is a collection of
>> machine learning functions for relational databases, i.e., Apache HAWQ
>> and PostgreSQL. There is no conflict in their target runtimes.
>>
>> === A Excessive Fascination with the Apache Brand ===
>>
>> Our interest for this incubation is attracting more contributors,
>> building a strong community with open governance, and increasing the
>> visibility of Hivemall in the market/community. We will be sensitive
>> to inadvertent abuse of the Apache brand for any commercial use and
>> will work with the Incubator PMC and project mentors to ensure the
>> brand policies are respected.
>>
>> == Documentation ==
>>
>> Information on Hivemall can be found at:
>> https://github.com/myui/hivemall/wiki
>>
>> == Initial Source ==
>>
>> We released the initial version of Hivemall in 2013 at
>> https://github.com/myui/hivemall and introduced Hivemall at the Hadoop
>> Summit 2014.
>>
>> == Source and Intellectual Property Submission Plan ==
>>
>> We know no legal encumberment to transfer of the source to Apache. We
>> are going to get Contributor License Agreement (CLA) for all property
>> of Hivemall.
>>
>> Also, we plan to get a sign from AIST for Software Grant Agreement (SGA).
>>
>> == External Dependencies ==
>>
>> Hivemall depends on the following third party libraries:
>>
>> Core module:
>>  * netty (The MIT License)
>>  * smile (Apache License v2.0)
>>  * org.takuaani.xz (Public Domain)
>>  * xgboost (Apache License v2.0)
>>  * hadoop (Apache License v2.0)
>>  * hive (Apache License v2.0)
>>  * log4j (Apache License v2.0)
>>  * guava (Apache License v2.0)
>>  * lucene-analyzers-kuromoji (Apache License v2.0)
>>  * junit (Eclipse Public License v1.0)
>>  * mockito (The MIT License)
>>  * powermock (Apache License v2.0)
>>  * kryo (BSD License)
>>
>> Hivemall on Spark:
>>  * spark (Apache License v2.0)
>>  * commons-cli  (Apache License v2.0)
>>  * commons-logging (Apache License v2.0)
>>  * commons-compress (Apache License v2.0)
>>  * scala-library (BSD License)
>>  * scalatest (Apache License v2.0)
>>  * xerial-core (Apache License v2.0)
>>
>> The dependencies all have Apache compatible licenses.
>>
>> == Cryptography ==
>>
>> N/A
>>
>> == Required resources ==
>>
>> === Mailing lists ===
>>
>>  * private@hivemall.incubator.apache.org  (with moderated subscriptions)
>>  * commits@hivemall.incubator.apache.org
>>  * dev@hivemall.incubator.apache.org
>>  * user@hivemall.incubator.apache.org
>>
>> === Git Repository ===
>>
>> https://git-wip-us.apache.org/repos/asf/incubator-hivemall.git
>>
>> === JIRA assistance ===
>>
>> JIRA project Hivemall (HIVEMALL)
>>
>> == Initial Committers ==
>>
>>  * Makoto Yui (myui@treasure-data.com)
>>  * Takeshi Yamamuro (yamamuro.takshi@lab.ntt.co.jp)
>>  * Daniel Dai (daijy@hortonworks.com)
>>  * Tsuyoshi Ozawa (ozawa.tsuyoshi@lab.ntt.co.jp)
>>  * Kai Sasaki (sasaki@treasure-data.com)
>>
>> == Affiliations ==
>>
>> === Treasure Data ===
>>  * Makoto Yui
>>  * Kai Sasaki
>>
>> === NTT ===
>>  * Takeshi Yamamuro
>>  * Tsuyoshi Ozawa Apache Hadoop PMC member
>>
>> === Hortonworks ===
>>  * Daniel Dai (ASF member) Apache Pig PMC member
>>
>> == Sponsors ==
>>
>> === Champion ===
>>  * Roman Shaposhnik (Pivotal, ASF member, IPMC member) Apache
>> Bigtop/Incubator PMC member
>>
>> === Nominated Mentors ===
>>
>>  * Reynold Xin (Dataricks, ASF member) Apache Spark PMC member
>>  * Markus Weimer (Microsoft, ASF member) Apache REEF PMC member
>>  * Xiangrui Meng (Databricks, ASF member) Apache Spark PMC member
>>
>> === Sponsoring Entity ===
>>
>> We are requesting the Incubator to sponsor this project.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>



-- 
Makoto YUI <myui AT treasure-data.com>
Research Engineer, Treasure Data, Inc.
http://myui.github.io/

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message