incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Makoto Yui <m...@treasure-data.com>
Subject Re: Call for Mentors
Date Wed, 31 Aug 2016 07:18:26 GMT
Jean-Baptistle,

Your experience as Podling mentor is very welcome.

Regards,
Makoto

2016-08-31 15:24 GMT+09:00 Jean-Baptiste Onofré <jb@nanthrax.net>:
> Hi Makoto,
>
> it would have been with lot of pleasure, but I'm already mentor in several
> podlings.
>
> Regards
> JB
>
>
> On 08/31/2016 06:30 AM, Makoto Yui wrote:
>>
>> As Roman mentioned, we welcome volunteering mentors.
>>
>> Please find our proposal in
>> https://wiki.apache.org/incubator/HivemallProposal
>>
>> Thanks,
>> Makoto
>>
>> 2016-08-31 11:28 GMT+09:00 Roman Shaposhnik <rvs@apache.org>:
>>>
>>> Hi!
>>>
>>> It seems that the discussion has converged and I'd like to
>>> make one extra call for volunteering mentors. Please let
>>> me know ASAP since I'd like to get the VOTE going tomorrow.
>>>
>>> Thanks,
>>> Roman.
>>>
>>> On Mon, Aug 22, 2016 at 10:20 AM, Roman Shaposhnik <rvs@apache.org>
>>> wrote:
>>>>
>>>> Hi!
>>>>
>>>> on behalf of the Hivemall team, I'd like to kick off
>>>> a discussion thread around accepting Hivemall
>>>> into and ASF Incubator.
>>>>
>>>> Hivemall is a library for machine learning implemented
>>>> as Hive UDFs/UDAFs/UDTFs that runs on Hadoop-based d
>>>> ata processing frameworks. More specifically it runs currently
>>>> runs on Apache Hive, Apache Spark, and Apache Pig, that
>>>> support Hive UDFs as an extension mechanism.
>>>>
>>>> Here's the link to the proposal:
>>>>     https://wiki.apache.org/incubator/HivemallProposal
>>>> and the full text is also attached to this email.
>>>>
>>>> Two of the areas that I'd like to explicitly solicit IPMC's opinion
>>>> on are:
>>>>     1. whether the process of re-licensing from LGPL to ALv2
>>>>      was enough given the ASF's strict IP policies
>>>>
>>>>      2. whether the 5 initial committers make sense given that
>>>>      there's a total of 15 contributors as per GitHub stats.
>>>>
>>>> With that, thanks, in advance, for your time and let the discussion
>>>> begin!
>>>>
>>>> Thanks,
>>>> Roman.
>>>>
>>>> == Abstract ==
>>>>
>>>> Hivemall is a library for machine learning implemented as Hive
>>>> UDFs/UDAFs/UDTFs.
>>>>
>>>> Hivemall runs on Hadoop-based data processing frameworks, specifically
>>>> on Apache Hive, Apache Spark, and Apache Pig, that support Hive UDFs
>>>> as an extension mechanism.
>>>>
>>>> == Proposal ==
>>>>
>>>> Hivemall is a collection of machine learning algorithms and versatile
>>>> data analytics functions. It provides a number of ease of use machine
>>>> learning functionalities through user-defined function (UDF),
>>>> user-defined aggregate function (UDAFs), and/or user-defined table
>>>> generating functions (UDTFs) of Apache Hive. It offers a variety of
>>>> functionalities: regression, classification, recommendation, anomaly
>>>> detection, k-nearest neighbor, and feature engineering. Hivemall
>>>> supports state-of-the-art machine learning algorithms such as Soft
>>>> Confidence Weighted, Adaptive Regularization of Weight Vectors,
>>>> Factorization Machines, and AdaDelta. Hivemall is mainly designed to
>>>> run on Apache Hive but it also supports Apache Pig and Apache Spark
>>>> for the runtime.
>>>>
>>>> == Background ==
>>>>
>>>> Hivemall started as a research project of the main developer at
>>>> National Institute of Advanced Industrial Science and Technology
>>>> (AIST) in 2013 and the initial version was released on 2 Oct, 2013 on
>>>> Github: https://github.com/myui/hivemall.
>>>>
>>>> After the main developer moving to Treasure Data in 2015, the project
>>>> has been actively developed as an open source product and changed the
>>>> license from GNU LGPL v2.1 to Apache License v2 on Mar 16, 2015. The
>>>> project copyright holders agreed to change the license then.
>>>>
>>>> The community is growing incrementally and the project has 15
>>>> contributors, 431 stars, and 131 forks on Github as of Aug 15, 2016.
>>>> The project was awarded for the InfoWorld Bossie Awards (the best open
>>>> source big data tools) in 2014.
>>>>
>>>> Past main contributions by external contributors includes Apache Pig
>>>> supports from Daniel Dai (Hortonworks), Apache Spark porting and an
>>>> integration to Apache YARN from Takeshi Yamamuro (NTT). Hivemall was
>>>> originally designed for Apache Hive but it now supports Apache Spark
>>>> and Apache Pig.
>>>>
>>>> == Rationale ==
>>>>
>>>> User-defined function is a powerful mechanism to enrich the expressive
>>>> power of declarative query languages like SQL, HiveQL, PigLatin, Spark
>>>> SQL. Hive UDF interface is now becoming the de-facto standard for
>>>> SQL-on-Hadoop platforms; Apache Spark and Apache Pig have full
>>>> supports for Hive UDFs/UDAFs/UDTFs, and Apache Impala, Apache Drill,
>>>> and Apache Tajo also have limited supports for Hive UDFs/UDAFs.
>>>>
>>>> Hivemall can be considered as a cross platform library for machine
>>>> learning as Hivemall is implemented as cross platform Hive
>>>> UDFs/UDAFs/UDTFs; prediction models built by a batch query of Apache
>>>> Hive can be used on Apache Spark/Pig, and conversely, prediction
>>>> models build by Apache Spark can be used from Apache Hive/Pig.
>>>>
>>>> Several database vendors are trying to offer machine learning
>>>> functionality in relational databases, so that the costs of moving
>>>> data can be eliminated. Apache MADlib, a machine learning library for
>>>> HAWQ and PostgreSQL, is accepted as an Apache Incubator project.
>>>> MADlib is implemented using PostgreSQL UDF interface.
>>>>
>>>> Apache Hive has a JIRA ticket in HIVE-7940 to support machine learning
>>>> functionalities. So, we consider this proposal is useful for the
>>>> community. We consider that Hivemall is better to be a separated
>>>> project to the Apache Hive because 1) we target other data processing
>>>> frameworks such as Apache Spark as well for the runtime of Hivemall,
>>>> and 2) the current codebase is large enough to be separated.
>>>> Separation of concerns is good for project governance (e.g., release
>>>> management). For example, Apache Datafu is data mining and statistics
>>>> library for Apache Pig and a separated project to Apache Pig.
>>>>
>>>> We consider that Hivemall would be a similar position to Apache Datafu
>>>> but there are large differences in features and target runtimes.
>>>> The target runtime of Apache Datafu is Apache Pig but Hivemall targets
>>>> Apache Hive, Apache Spark, and Apache Pig for the target runtime.
>>>> Apache Datafu is more likely to be statistics library and does not
>>>> support machine learning features such as classification and
>>>> regression but Hivemall is a machine learning library supporting them.
>>>>
>>>> == Initial Goals ==
>>>>
>>>> The initial goals are as follows:
>>>>  * Establish the project governance in the Apache way and broaden the
>>>> community
>>>>  * Improve documentations.
>>>>  * Adding more unit/scenario tests.
>>>>  * Handover of code and copyrights
>>>>
>>>> == Current Status ==
>>>>
>>>> Hivemall has several on-going WIP features.
>>>>
>>>> Making a parameter server (a kind of distributed key-value store) as
>>>> Apache YARN application is a major issue. Hivemall’s parameter server
>>>> is currently a standalone application. Parameter servers on Apache
>>>> YARN enables to use Hadoop cluster resource efficiently and makes
>>>> management of parameter servers easier.
>>>>
>>>> Another major WIP issue is integrating XGBoost into Hivemall. We need
>>>> more works and tests, e.g., supporting cross compilation of native JNI
>>>> objects of XGBoost.
>>>>
>>>> === Meritocracy ===
>>>>
>>>> The project members understand the importance of letting motivated
>>>> individuals contribute to the project. Since Hivemall was initially
>>>> released in 2014, it has received contributions from 14 contributors.
>>>>
>>>> Our intent of this incubator proposal is building a diverse developer
>>>> community following the Apache meritocracy model. We welcome external
>>>> contributions and plan to elect committers from those who contribute
>>>> significantly to the project.
>>>>
>>>> === Community ===
>>>>
>>>> While there are 15 contributors in total, there are 3-4 active
>>>> developers continuously involved for the major feature development at
>>>> the moment.  We hope to extend our contributor base and encourages
>>>> suggestions and contributions from any potential user.
>>>>
>>>> === Core Developers ===
>>>>
>>>> The current main developers are from employees of Treasure Data, NTT
>>>> and Hortonworks. Some of them are Hadoop/Pig PMCs and/or Hive
>>>> committers.
>>>>
>>>> === Alignment ===
>>>>
>>>> Incubating at ASF is the natural choice for the Hivemall project
>>>> because the Hivemall is targeting to run on Apache Hive, Apache Spark,
>>>> and Apache Pig. We encourage integrations with other ASF data
>>>> processing frameworks like Apache Impala and Apache Drill.
>>>>
>>>> == Known Risks ==
>>>>
>>>> The contributions of the main developer is significant at the moment
>>>> but the dependencies would decrease as the community grows.
>>>>
>>>> === Orphaned products ===
>>>>
>>>> While the main developer is developing Hivemall as a full-time job at
>>>> TreasureData, the company is well being aware of the open source
>>>> philosophy and the importance of open governance of open source
>>>> products. Orphanining ASF product can be considered itself as a risk.
>>>> Hence, we think the the risks of it being orphaned are minimal.
>>>>
>>>> === Inexperience with Open Source ===
>>>>
>>>> Hivemall also has been developed as an open source project since 2013.
>>>> The majority of the project member have jobs developing open source
>>>> products and some of them are working on other ASF projects like
>>>> Apache Hadoop and Apache Pig. We thus considered that the project
>>>> members have enough experiences for open source development.
>>>>
>>>> === Homogenous Developers ===
>>>>
>>>> The current list of committers consists of developers from three
>>>> different companies. The committers are geographically distributed
>>>> across the U.S. and Asia. They are experienced with working in a
>>>> distributed environment.
>>>>
>>>> While not included in the initial committer, there are other external
>>>> contributors to the project. So, we hope to establish a developer
>>>> community that includes those contributors from several other
>>>> corporations during the incubation process.
>>>>
>>>> === Reliance on Salaried Developers ===
>>>>
>>>> The major developer is paid by his employer to contribute to this
>>>> project and the other developers are payed by their employers for
>>>> Hadoop-related open source development. While they might change their
>>>> affiliations over time, they are willing to have their expertise for
>>>> the open source development. So, the project would continue regardless
>>>> their affiliations.
>>>>
>>>> === Relationships with Other Apache Products ===
>>>>
>>>> Hivemall is a collection for machine learning functions on Apache
>>>> Hive, Apache Spark, and Apache Pig. Apache MADlib is a collection of
>>>> machine learning functions for relational databases, i.e., Apache HAWQ
>>>> and PostgreSQL. There is no conflict in their target runtimes.
>>>>
>>>> === A Excessive Fascination with the Apache Brand ===
>>>>
>>>> Our interest for this incubation is attracting more contributors,
>>>> building a strong community with open governance, and increasing the
>>>> visibility of Hivemall in the market/community. We will be sensitive
>>>> to inadvertent abuse of the Apache brand for any commercial use and
>>>> will work with the Incubator PMC and project mentors to ensure the
>>>> brand policies are respected.
>>>>
>>>> == Documentation ==
>>>>
>>>> Information on Hivemall can be found at:
>>>> https://github.com/myui/hivemall/wiki
>>>>
>>>> == Initial Source ==
>>>>
>>>> We released the initial version of Hivemall in 2013 at
>>>> https://github.com/myui/hivemall and introduced Hivemall at the Hadoop
>>>> Summit 2014.
>>>>
>>>> == Source and Intellectual Property Submission Plan ==
>>>>
>>>> We know no legal encumberment to transfer of the source to Apache. We
>>>> are going to get Contributor License Agreement (CLA) for all property
>>>> of Hivemall.
>>>>
>>>> Also, we plan to get a sign from AIST for Software Grant Agreement
>>>> (SGA).
>>>>
>>>> == External Dependencies ==
>>>>
>>>> Hivemall depends on the following third party libraries:
>>>>
>>>> Core module:
>>>>  * netty (The MIT License)
>>>>  * smile (Apache License v2.0)
>>>>  * org.takuaani.xz (Public Domain)
>>>>  * xgboost (Apache License v2.0)
>>>>  * hadoop (Apache License v2.0)
>>>>  * hive (Apache License v2.0)
>>>>  * log4j (Apache License v2.0)
>>>>  * guava (Apache License v2.0)
>>>>  * lucene-analyzers-kuromoji (Apache License v2.0)
>>>>  * junit (Eclipse Public License v1.0)
>>>>  * mockito (The MIT License)
>>>>  * powermock (Apache License v2.0)
>>>>  * kryo (BSD License)
>>>>
>>>> Hivemall on Spark:
>>>>  * spark (Apache License v2.0)
>>>>  * commons-cli  (Apache License v2.0)
>>>>  * commons-logging (Apache License v2.0)
>>>>  * commons-compress (Apache License v2.0)
>>>>  * scala-library (BSD License)
>>>>  * scalatest (Apache License v2.0)
>>>>  * xerial-core (Apache License v2.0)
>>>>
>>>> The dependencies all have Apache compatible licenses.
>>>>
>>>> == Cryptography ==
>>>>
>>>> N/A
>>>>
>>>> == Required resources ==
>>>>
>>>> === Mailing lists ===
>>>>
>>>>  * private@hivemall.incubator.apache.org  (with moderated subscriptions)
>>>>  * commits@hivemall.incubator.apache.org
>>>>  * dev@hivemall.incubator.apache.org
>>>>  * user@hivemall.incubator.apache.org
>>>>
>>>> === Git Repository ===
>>>>
>>>> https://git-wip-us.apache.org/repos/asf/incubator-hivemall.git
>>>>
>>>> === JIRA assistance ===
>>>>
>>>> JIRA project Hivemall (HIVEMALL)
>>>>
>>>> == Initial Committers ==
>>>>
>>>>  * Makoto Yui (myui@treasure-data.com)
>>>>  * Takeshi Yamamuro (yamamuro.takshi@lab.ntt.co.jp)
>>>>  * Daniel Dai (daijy@hortonworks.com)
>>>>  * Tsuyoshi Ozawa (ozawa.tsuyoshi@lab.ntt.co.jp)
>>>>  * Kai Sasaki (sasaki@treasure-data.com)
>>>>
>>>> == Affiliations ==
>>>>
>>>> === Treasure Data ===
>>>>  * Makoto Yui
>>>>  * Kai Sasaki
>>>>
>>>> === NTT ===
>>>>  * Takeshi Yamamuro
>>>>  * Tsuyoshi Ozawa Apache Hadoop PMC member
>>>>
>>>> === Hortonworks ===
>>>>  * Daniel Dai (ASF member) Apache Pig PMC member
>>>>
>>>> == Sponsors ==
>>>>
>>>> === Champion ===
>>>>  * Roman Shaposhnik (Pivotal, ASF member, IPMC member) Apache
>>>> Bigtop/Incubator PMC member
>>>>
>>>> === Nominated Mentors ===
>>>>
>>>>  * Reynold Xin (Dataricks, ASF member) Apache Spark PMC member
>>>>  * Markus Weimer (Microsoft, ASF member) Apache REEF PMC member
>>>>  * Xiangrui Meng (Databricks, ASF member) Apache Spark PMC member
>>>>
>>>> === Sponsoring Entity ===
>>>>
>>>> We are requesting the Incubator to sponsor this project.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>>> For additional commands, e-mail: general-help@incubator.apache.org
>>>
>>
>>
>>
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>



-- 
Makoto YUI <myui AT treasure-data.com>
Research Engineer, Treasure Data, Inc.
http://myui.github.io/

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message