incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Makoto Yui <m...@treasure-data.com>
Subject Re: Call for Mentors
Date Wed, 31 Aug 2016 09:49:13 GMT
JB,

Sure. Looking forward to see potential mentors.

We have currently 3 mentors but additional mentors are welcome
since some of them are busy taking vacation :-)

Thanks,
Makoto

2016-08-31 17:37 GMT+09:00 Jean-Baptiste Onofré <jb@nanthrax.net>:
> I'm very busy mentoring my current podling bucket. I'm sure other potential
> mentors will contact you !
>
> Regards
> JB
>
>
> On 08/31/2016 09:18 AM, Makoto Yui wrote:
>>
>> Jean-Baptistle,
>>
>> Your experience as Podling mentor is very welcome.
>>
>> Regards,
>> Makoto
>>
>> 2016-08-31 15:24 GMT+09:00 Jean-Baptiste Onofré <jb@nanthrax.net>:
>>>
>>> Hi Makoto,
>>>
>>> it would have been with lot of pleasure, but I'm already mentor in
>>> several
>>> podlings.
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 08/31/2016 06:30 AM, Makoto Yui wrote:
>>>>
>>>>
>>>> As Roman mentioned, we welcome volunteering mentors.
>>>>
>>>> Please find our proposal in
>>>> https://wiki.apache.org/incubator/HivemallProposal
>>>>
>>>> Thanks,
>>>> Makoto
>>>>
>>>> 2016-08-31 11:28 GMT+09:00 Roman Shaposhnik <rvs@apache.org>:
>>>>>
>>>>>
>>>>> Hi!
>>>>>
>>>>> It seems that the discussion has converged and I'd like to
>>>>> make one extra call for volunteering mentors. Please let
>>>>> me know ASAP since I'd like to get the VOTE going tomorrow.
>>>>>
>>>>> Thanks,
>>>>> Roman.
>>>>>
>>>>> On Mon, Aug 22, 2016 at 10:20 AM, Roman Shaposhnik <rvs@apache.org>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Hi!
>>>>>>
>>>>>> on behalf of the Hivemall team, I'd like to kick off
>>>>>> a discussion thread around accepting Hivemall
>>>>>> into and ASF Incubator.
>>>>>>
>>>>>> Hivemall is a library for machine learning implemented
>>>>>> as Hive UDFs/UDAFs/UDTFs that runs on Hadoop-based d
>>>>>> ata processing frameworks. More specifically it runs currently
>>>>>> runs on Apache Hive, Apache Spark, and Apache Pig, that
>>>>>> support Hive UDFs as an extension mechanism.
>>>>>>
>>>>>> Here's the link to the proposal:
>>>>>>     https://wiki.apache.org/incubator/HivemallProposal
>>>>>> and the full text is also attached to this email.
>>>>>>
>>>>>> Two of the areas that I'd like to explicitly solicit IPMC's opinion
>>>>>> on are:
>>>>>>     1. whether the process of re-licensing from LGPL to ALv2
>>>>>>      was enough given the ASF's strict IP policies
>>>>>>
>>>>>>      2. whether the 5 initial committers make sense given that
>>>>>>      there's a total of 15 contributors as per GitHub stats.
>>>>>>
>>>>>> With that, thanks, in advance, for your time and let the discussion
>>>>>> begin!
>>>>>>
>>>>>> Thanks,
>>>>>> Roman.
>>>>>>
>>>>>> == Abstract ==
>>>>>>
>>>>>> Hivemall is a library for machine learning implemented as Hive
>>>>>> UDFs/UDAFs/UDTFs.
>>>>>>
>>>>>> Hivemall runs on Hadoop-based data processing frameworks, specifically
>>>>>> on Apache Hive, Apache Spark, and Apache Pig, that support Hive UDFs
>>>>>> as an extension mechanism.
>>>>>>
>>>>>> == Proposal ==
>>>>>>
>>>>>> Hivemall is a collection of machine learning algorithms and versatile
>>>>>> data analytics functions. It provides a number of ease of use machine
>>>>>> learning functionalities through user-defined function (UDF),
>>>>>> user-defined aggregate function (UDAFs), and/or user-defined table
>>>>>> generating functions (UDTFs) of Apache Hive. It offers a variety
of
>>>>>> functionalities: regression, classification, recommendation, anomaly
>>>>>> detection, k-nearest neighbor, and feature engineering. Hivemall
>>>>>> supports state-of-the-art machine learning algorithms such as Soft
>>>>>> Confidence Weighted, Adaptive Regularization of Weight Vectors,
>>>>>> Factorization Machines, and AdaDelta. Hivemall is mainly designed
to
>>>>>> run on Apache Hive but it also supports Apache Pig and Apache Spark
>>>>>> for the runtime.
>>>>>>
>>>>>> == Background ==
>>>>>>
>>>>>> Hivemall started as a research project of the main developer at
>>>>>> National Institute of Advanced Industrial Science and Technology
>>>>>> (AIST) in 2013 and the initial version was released on 2 Oct, 2013
on
>>>>>> Github: https://github.com/myui/hivemall.
>>>>>>
>>>>>> After the main developer moving to Treasure Data in 2015, the project
>>>>>> has been actively developed as an open source product and changed
the
>>>>>> license from GNU LGPL v2.1 to Apache License v2 on Mar 16, 2015.
The
>>>>>> project copyright holders agreed to change the license then.
>>>>>>
>>>>>> The community is growing incrementally and the project has 15
>>>>>> contributors, 431 stars, and 131 forks on Github as of Aug 15, 2016.
>>>>>> The project was awarded for the InfoWorld Bossie Awards (the best
open
>>>>>> source big data tools) in 2014.
>>>>>>
>>>>>> Past main contributions by external contributors includes Apache
Pig
>>>>>> supports from Daniel Dai (Hortonworks), Apache Spark porting and
an
>>>>>> integration to Apache YARN from Takeshi Yamamuro (NTT). Hivemall
was
>>>>>> originally designed for Apache Hive but it now supports Apache Spark
>>>>>> and Apache Pig.
>>>>>>
>>>>>> == Rationale ==
>>>>>>
>>>>>> User-defined function is a powerful mechanism to enrich the expressive
>>>>>> power of declarative query languages like SQL, HiveQL, PigLatin,
Spark
>>>>>> SQL. Hive UDF interface is now becoming the de-facto standard for
>>>>>> SQL-on-Hadoop platforms; Apache Spark and Apache Pig have full
>>>>>> supports for Hive UDFs/UDAFs/UDTFs, and Apache Impala, Apache Drill,
>>>>>> and Apache Tajo also have limited supports for Hive UDFs/UDAFs.
>>>>>>
>>>>>> Hivemall can be considered as a cross platform library for machine
>>>>>> learning as Hivemall is implemented as cross platform Hive
>>>>>> UDFs/UDAFs/UDTFs; prediction models built by a batch query of Apache
>>>>>> Hive can be used on Apache Spark/Pig, and conversely, prediction
>>>>>> models build by Apache Spark can be used from Apache Hive/Pig.
>>>>>>
>>>>>> Several database vendors are trying to offer machine learning
>>>>>> functionality in relational databases, so that the costs of moving
>>>>>> data can be eliminated. Apache MADlib, a machine learning library
for
>>>>>> HAWQ and PostgreSQL, is accepted as an Apache Incubator project.
>>>>>> MADlib is implemented using PostgreSQL UDF interface.
>>>>>>
>>>>>> Apache Hive has a JIRA ticket in HIVE-7940 to support machine learning
>>>>>> functionalities. So, we consider this proposal is useful for the
>>>>>> community. We consider that Hivemall is better to be a separated
>>>>>> project to the Apache Hive because 1) we target other data processing
>>>>>> frameworks such as Apache Spark as well for the runtime of Hivemall,
>>>>>> and 2) the current codebase is large enough to be separated.
>>>>>> Separation of concerns is good for project governance (e.g., release
>>>>>> management). For example, Apache Datafu is data mining and statistics
>>>>>> library for Apache Pig and a separated project to Apache Pig.
>>>>>>
>>>>>> We consider that Hivemall would be a similar position to Apache Datafu
>>>>>> but there are large differences in features and target runtimes.
>>>>>> The target runtime of Apache Datafu is Apache Pig but Hivemall targets
>>>>>> Apache Hive, Apache Spark, and Apache Pig for the target runtime.
>>>>>> Apache Datafu is more likely to be statistics library and does not
>>>>>> support machine learning features such as classification and
>>>>>> regression but Hivemall is a machine learning library supporting
them.
>>>>>>
>>>>>> == Initial Goals ==
>>>>>>
>>>>>> The initial goals are as follows:
>>>>>>  * Establish the project governance in the Apache way and broaden
the
>>>>>> community
>>>>>>  * Improve documentations.
>>>>>>  * Adding more unit/scenario tests.
>>>>>>  * Handover of code and copyrights
>>>>>>
>>>>>> == Current Status ==
>>>>>>
>>>>>> Hivemall has several on-going WIP features.
>>>>>>
>>>>>> Making a parameter server (a kind of distributed key-value store)
as
>>>>>> Apache YARN application is a major issue. Hivemall’s parameter
server
>>>>>> is currently a standalone application. Parameter servers on Apache
>>>>>> YARN enables to use Hadoop cluster resource efficiently and makes
>>>>>> management of parameter servers easier.
>>>>>>
>>>>>> Another major WIP issue is integrating XGBoost into Hivemall. We
need
>>>>>> more works and tests, e.g., supporting cross compilation of native
JNI
>>>>>> objects of XGBoost.
>>>>>>
>>>>>> === Meritocracy ===
>>>>>>
>>>>>> The project members understand the importance of letting motivated
>>>>>> individuals contribute to the project. Since Hivemall was initially
>>>>>> released in 2014, it has received contributions from 14 contributors.
>>>>>>
>>>>>> Our intent of this incubator proposal is building a diverse developer
>>>>>> community following the Apache meritocracy model. We welcome external
>>>>>> contributions and plan to elect committers from those who contribute
>>>>>> significantly to the project.
>>>>>>
>>>>>> === Community ===
>>>>>>
>>>>>> While there are 15 contributors in total, there are 3-4 active
>>>>>> developers continuously involved for the major feature development
at
>>>>>> the moment.  We hope to extend our contributor base and encourages
>>>>>> suggestions and contributions from any potential user.
>>>>>>
>>>>>> === Core Developers ===
>>>>>>
>>>>>> The current main developers are from employees of Treasure Data,
NTT
>>>>>> and Hortonworks. Some of them are Hadoop/Pig PMCs and/or Hive
>>>>>> committers.
>>>>>>
>>>>>> === Alignment ===
>>>>>>
>>>>>> Incubating at ASF is the natural choice for the Hivemall project
>>>>>> because the Hivemall is targeting to run on Apache Hive, Apache Spark,
>>>>>> and Apache Pig. We encourage integrations with other ASF data
>>>>>> processing frameworks like Apache Impala and Apache Drill.
>>>>>>
>>>>>> == Known Risks ==
>>>>>>
>>>>>> The contributions of the main developer is significant at the moment
>>>>>> but the dependencies would decrease as the community grows.
>>>>>>
>>>>>> === Orphaned products ===
>>>>>>
>>>>>> While the main developer is developing Hivemall as a full-time job
at
>>>>>> TreasureData, the company is well being aware of the open source
>>>>>> philosophy and the importance of open governance of open source
>>>>>> products. Orphanining ASF product can be considered itself as a risk.
>>>>>> Hence, we think the the risks of it being orphaned are minimal.
>>>>>>
>>>>>> === Inexperience with Open Source ===
>>>>>>
>>>>>> Hivemall also has been developed as an open source project since
2013.
>>>>>> The majority of the project member have jobs developing open source
>>>>>> products and some of them are working on other ASF projects like
>>>>>> Apache Hadoop and Apache Pig. We thus considered that the project
>>>>>> members have enough experiences for open source development.
>>>>>>
>>>>>> === Homogenous Developers ===
>>>>>>
>>>>>> The current list of committers consists of developers from three
>>>>>> different companies. The committers are geographically distributed
>>>>>> across the U.S. and Asia. They are experienced with working in a
>>>>>> distributed environment.
>>>>>>
>>>>>> While not included in the initial committer, there are other external
>>>>>> contributors to the project. So, we hope to establish a developer
>>>>>> community that includes those contributors from several other
>>>>>> corporations during the incubation process.
>>>>>>
>>>>>> === Reliance on Salaried Developers ===
>>>>>>
>>>>>> The major developer is paid by his employer to contribute to this
>>>>>> project and the other developers are payed by their employers for
>>>>>> Hadoop-related open source development. While they might change their
>>>>>> affiliations over time, they are willing to have their expertise
for
>>>>>> the open source development. So, the project would continue regardless
>>>>>> their affiliations.
>>>>>>
>>>>>> === Relationships with Other Apache Products ===
>>>>>>
>>>>>> Hivemall is a collection for machine learning functions on Apache
>>>>>> Hive, Apache Spark, and Apache Pig. Apache MADlib is a collection
of
>>>>>> machine learning functions for relational databases, i.e., Apache
HAWQ
>>>>>> and PostgreSQL. There is no conflict in their target runtimes.
>>>>>>
>>>>>> === A Excessive Fascination with the Apache Brand ===
>>>>>>
>>>>>> Our interest for this incubation is attracting more contributors,
>>>>>> building a strong community with open governance, and increasing
the
>>>>>> visibility of Hivemall in the market/community. We will be sensitive
>>>>>> to inadvertent abuse of the Apache brand for any commercial use and
>>>>>> will work with the Incubator PMC and project mentors to ensure the
>>>>>> brand policies are respected.
>>>>>>
>>>>>> == Documentation ==
>>>>>>
>>>>>> Information on Hivemall can be found at:
>>>>>> https://github.com/myui/hivemall/wiki
>>>>>>
>>>>>> == Initial Source ==
>>>>>>
>>>>>> We released the initial version of Hivemall in 2013 at
>>>>>> https://github.com/myui/hivemall and introduced Hivemall at the Hadoop
>>>>>> Summit 2014.
>>>>>>
>>>>>> == Source and Intellectual Property Submission Plan ==
>>>>>>
>>>>>> We know no legal encumberment to transfer of the source to Apache.
We
>>>>>> are going to get Contributor License Agreement (CLA) for all property
>>>>>> of Hivemall.
>>>>>>
>>>>>> Also, we plan to get a sign from AIST for Software Grant Agreement
>>>>>> (SGA).
>>>>>>
>>>>>> == External Dependencies ==
>>>>>>
>>>>>> Hivemall depends on the following third party libraries:
>>>>>>
>>>>>> Core module:
>>>>>>  * netty (The MIT License)
>>>>>>  * smile (Apache License v2.0)
>>>>>>  * org.takuaani.xz (Public Domain)
>>>>>>  * xgboost (Apache License v2.0)
>>>>>>  * hadoop (Apache License v2.0)
>>>>>>  * hive (Apache License v2.0)
>>>>>>  * log4j (Apache License v2.0)
>>>>>>  * guava (Apache License v2.0)
>>>>>>  * lucene-analyzers-kuromoji (Apache License v2.0)
>>>>>>  * junit (Eclipse Public License v1.0)
>>>>>>  * mockito (The MIT License)
>>>>>>  * powermock (Apache License v2.0)
>>>>>>  * kryo (BSD License)
>>>>>>
>>>>>> Hivemall on Spark:
>>>>>>  * spark (Apache License v2.0)
>>>>>>  * commons-cli  (Apache License v2.0)
>>>>>>  * commons-logging (Apache License v2.0)
>>>>>>  * commons-compress (Apache License v2.0)
>>>>>>  * scala-library (BSD License)
>>>>>>  * scalatest (Apache License v2.0)
>>>>>>  * xerial-core (Apache License v2.0)
>>>>>>
>>>>>> The dependencies all have Apache compatible licenses.
>>>>>>
>>>>>> == Cryptography ==
>>>>>>
>>>>>> N/A
>>>>>>
>>>>>> == Required resources ==
>>>>>>
>>>>>> === Mailing lists ===
>>>>>>
>>>>>>  * private@hivemall.incubator.apache.org  (with moderated
>>>>>> subscriptions)
>>>>>>  * commits@hivemall.incubator.apache.org
>>>>>>  * dev@hivemall.incubator.apache.org
>>>>>>  * user@hivemall.incubator.apache.org
>>>>>>
>>>>>> === Git Repository ===
>>>>>>
>>>>>> https://git-wip-us.apache.org/repos/asf/incubator-hivemall.git
>>>>>>
>>>>>> === JIRA assistance ===
>>>>>>
>>>>>> JIRA project Hivemall (HIVEMALL)
>>>>>>
>>>>>> == Initial Committers ==
>>>>>>
>>>>>>  * Makoto Yui (myui@treasure-data.com)
>>>>>>  * Takeshi Yamamuro (yamamuro.takshi@lab.ntt.co.jp)
>>>>>>  * Daniel Dai (daijy@hortonworks.com)
>>>>>>  * Tsuyoshi Ozawa (ozawa.tsuyoshi@lab.ntt.co.jp)
>>>>>>  * Kai Sasaki (sasaki@treasure-data.com)
>>>>>>
>>>>>> == Affiliations ==
>>>>>>
>>>>>> === Treasure Data ===
>>>>>>  * Makoto Yui
>>>>>>  * Kai Sasaki
>>>>>>
>>>>>> === NTT ===
>>>>>>  * Takeshi Yamamuro
>>>>>>  * Tsuyoshi Ozawa Apache Hadoop PMC member
>>>>>>
>>>>>> === Hortonworks ===
>>>>>>  * Daniel Dai (ASF member) Apache Pig PMC member
>>>>>>
>>>>>> == Sponsors ==
>>>>>>
>>>>>> === Champion ===
>>>>>>  * Roman Shaposhnik (Pivotal, ASF member, IPMC member) Apache
>>>>>> Bigtop/Incubator PMC member
>>>>>>
>>>>>> === Nominated Mentors ===
>>>>>>
>>>>>>  * Reynold Xin (Dataricks, ASF member) Apache Spark PMC member
>>>>>>  * Markus Weimer (Microsoft, ASF member) Apache REEF PMC member
>>>>>>  * Xiangrui Meng (Databricks, ASF member) Apache Spark PMC member
>>>>>>
>>>>>> === Sponsoring Entity ===
>>>>>>
>>>>>> We are requesting the Incubator to sponsor this project.
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>>>>> For additional commands, e-mail: general-help@incubator.apache.org
>>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbonofre@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>>> For additional commands, e-mail: general-help@incubator.apache.org
>>>
>>
>>
>>
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>



-- 
Makoto YUI <myui AT treasure-data.com>
Research Engineer, Treasure Data, Inc.
http://myui.github.io/

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message