Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 9FED2200B80 for ; Wed, 31 Aug 2016 08:24:12 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 9E358160AC5; Wed, 31 Aug 2016 06:24:12 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 93B94160ABA for ; Wed, 31 Aug 2016 08:24:11 +0200 (CEST) Received: (qmail 44413 invoked by uid 500); 31 Aug 2016 06:24:10 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 44398 invoked by uid 99); 31 Aug 2016 06:24:10 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Aug 2016 06:24:10 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 928C81A8C1D for ; Wed, 31 Aug 2016 06:24:09 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.3 X-Spam-Level: X-Spam-Status: No, score=0.3 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id C3T3nrTr1vZg for ; Wed, 31 Aug 2016 06:24:07 +0000 (UTC) Received: from relay6-d.mail.gandi.net (relay6-d.mail.gandi.net [217.70.183.198]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id D354F5F4E3 for ; Wed, 31 Aug 2016 06:24:06 +0000 (UTC) Received: from mfilter46-d.gandi.net (mfilter46-d.gandi.net [217.70.178.177]) by relay6-d.mail.gandi.net (Postfix) with ESMTP id 121CDFB881 for ; Wed, 31 Aug 2016 08:24:06 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at mfilter46-d.gandi.net Received: from relay6-d.mail.gandi.net ([IPv6:::ffff:217.70.183.198]) by mfilter46-d.gandi.net (mfilter46-d.gandi.net [::ffff:10.0.15.180]) (amavisd-new, port 10024) with ESMTP id lNM4ITff-894 for ; Wed, 31 Aug 2016 08:24:03 +0200 (CEST) X-Originating-IP: 82.238.224.4 Received: from [192.168.134.10] (bre91-1-82-238-224-4.fbx.proxad.net [82.238.224.4]) (Authenticated sender: jb@nanthrax.net) by relay6-d.mail.gandi.net (Postfix) with ESMTPSA id 8061BFB883 for ; Wed, 31 Aug 2016 08:24:03 +0200 (CEST) Subject: Re: Call for Mentors To: general@incubator.apache.org References: From: =?UTF-8?Q?Jean-Baptiste_Onofr=c3=a9?= Message-ID: <258ee111-edbd-6896-52bb-2d0fda4acf2d@nanthrax.net> Date: Wed, 31 Aug 2016 08:24:03 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit archived-at: Wed, 31 Aug 2016 06:24:12 -0000 Hi Makoto, it would have been with lot of pleasure, but I'm already mentor in several podlings. Regards JB On 08/31/2016 06:30 AM, Makoto Yui wrote: > As Roman mentioned, we welcome volunteering mentors. > > Please find our proposal in > https://wiki.apache.org/incubator/HivemallProposal > > Thanks, > Makoto > > 2016-08-31 11:28 GMT+09:00 Roman Shaposhnik : >> Hi! >> >> It seems that the discussion has converged and I'd like to >> make one extra call for volunteering mentors. Please let >> me know ASAP since I'd like to get the VOTE going tomorrow. >> >> Thanks, >> Roman. >> >> On Mon, Aug 22, 2016 at 10:20 AM, Roman Shaposhnik wrote: >>> Hi! >>> >>> on behalf of the Hivemall team, I'd like to kick off >>> a discussion thread around accepting Hivemall >>> into and ASF Incubator. >>> >>> Hivemall is a library for machine learning implemented >>> as Hive UDFs/UDAFs/UDTFs that runs on Hadoop-based d >>> ata processing frameworks. More specifically it runs currently >>> runs on Apache Hive, Apache Spark, and Apache Pig, that >>> support Hive UDFs as an extension mechanism. >>> >>> Here's the link to the proposal: >>> https://wiki.apache.org/incubator/HivemallProposal >>> and the full text is also attached to this email. >>> >>> Two of the areas that I'd like to explicitly solicit IPMC's opinion >>> on are: >>> 1. whether the process of re-licensing from LGPL to ALv2 >>> was enough given the ASF's strict IP policies >>> >>> 2. whether the 5 initial committers make sense given that >>> there's a total of 15 contributors as per GitHub stats. >>> >>> With that, thanks, in advance, for your time and let the discussion begin! >>> >>> Thanks, >>> Roman. >>> >>> == Abstract == >>> >>> Hivemall is a library for machine learning implemented as Hive UDFs/UDAFs/UDTFs. >>> >>> Hivemall runs on Hadoop-based data processing frameworks, specifically >>> on Apache Hive, Apache Spark, and Apache Pig, that support Hive UDFs >>> as an extension mechanism. >>> >>> == Proposal == >>> >>> Hivemall is a collection of machine learning algorithms and versatile >>> data analytics functions. It provides a number of ease of use machine >>> learning functionalities through user-defined function (UDF), >>> user-defined aggregate function (UDAFs), and/or user-defined table >>> generating functions (UDTFs) of Apache Hive. It offers a variety of >>> functionalities: regression, classification, recommendation, anomaly >>> detection, k-nearest neighbor, and feature engineering. Hivemall >>> supports state-of-the-art machine learning algorithms such as Soft >>> Confidence Weighted, Adaptive Regularization of Weight Vectors, >>> Factorization Machines, and AdaDelta. Hivemall is mainly designed to >>> run on Apache Hive but it also supports Apache Pig and Apache Spark >>> for the runtime. >>> >>> == Background == >>> >>> Hivemall started as a research project of the main developer at >>> National Institute of Advanced Industrial Science and Technology >>> (AIST) in 2013 and the initial version was released on 2 Oct, 2013 on >>> Github: https://github.com/myui/hivemall. >>> >>> After the main developer moving to Treasure Data in 2015, the project >>> has been actively developed as an open source product and changed the >>> license from GNU LGPL v2.1 to Apache License v2 on Mar 16, 2015. The >>> project copyright holders agreed to change the license then. >>> >>> The community is growing incrementally and the project has 15 >>> contributors, 431 stars, and 131 forks on Github as of Aug 15, 2016. >>> The project was awarded for the InfoWorld Bossie Awards (the best open >>> source big data tools) in 2014. >>> >>> Past main contributions by external contributors includes Apache Pig >>> supports from Daniel Dai (Hortonworks), Apache Spark porting and an >>> integration to Apache YARN from Takeshi Yamamuro (NTT). Hivemall was >>> originally designed for Apache Hive but it now supports Apache Spark >>> and Apache Pig. >>> >>> == Rationale == >>> >>> User-defined function is a powerful mechanism to enrich the expressive >>> power of declarative query languages like SQL, HiveQL, PigLatin, Spark >>> SQL. Hive UDF interface is now becoming the de-facto standard for >>> SQL-on-Hadoop platforms; Apache Spark and Apache Pig have full >>> supports for Hive UDFs/UDAFs/UDTFs, and Apache Impala, Apache Drill, >>> and Apache Tajo also have limited supports for Hive UDFs/UDAFs. >>> >>> Hivemall can be considered as a cross platform library for machine >>> learning as Hivemall is implemented as cross platform Hive >>> UDFs/UDAFs/UDTFs; prediction models built by a batch query of Apache >>> Hive can be used on Apache Spark/Pig, and conversely, prediction >>> models build by Apache Spark can be used from Apache Hive/Pig. >>> >>> Several database vendors are trying to offer machine learning >>> functionality in relational databases, so that the costs of moving >>> data can be eliminated. Apache MADlib, a machine learning library for >>> HAWQ and PostgreSQL, is accepted as an Apache Incubator project. >>> MADlib is implemented using PostgreSQL UDF interface. >>> >>> Apache Hive has a JIRA ticket in HIVE-7940 to support machine learning >>> functionalities. So, we consider this proposal is useful for the >>> community. We consider that Hivemall is better to be a separated >>> project to the Apache Hive because 1) we target other data processing >>> frameworks such as Apache Spark as well for the runtime of Hivemall, >>> and 2) the current codebase is large enough to be separated. >>> Separation of concerns is good for project governance (e.g., release >>> management). For example, Apache Datafu is data mining and statistics >>> library for Apache Pig and a separated project to Apache Pig. >>> >>> We consider that Hivemall would be a similar position to Apache Datafu >>> but there are large differences in features and target runtimes. >>> The target runtime of Apache Datafu is Apache Pig but Hivemall targets >>> Apache Hive, Apache Spark, and Apache Pig for the target runtime. >>> Apache Datafu is more likely to be statistics library and does not >>> support machine learning features such as classification and >>> regression but Hivemall is a machine learning library supporting them. >>> >>> == Initial Goals == >>> >>> The initial goals are as follows: >>> * Establish the project governance in the Apache way and broaden the community >>> * Improve documentations. >>> * Adding more unit/scenario tests. >>> * Handover of code and copyrights >>> >>> == Current Status == >>> >>> Hivemall has several on-going WIP features. >>> >>> Making a parameter server (a kind of distributed key-value store) as >>> Apache YARN application is a major issue. Hivemall’s parameter server >>> is currently a standalone application. Parameter servers on Apache >>> YARN enables to use Hadoop cluster resource efficiently and makes >>> management of parameter servers easier. >>> >>> Another major WIP issue is integrating XGBoost into Hivemall. We need >>> more works and tests, e.g., supporting cross compilation of native JNI >>> objects of XGBoost. >>> >>> === Meritocracy === >>> >>> The project members understand the importance of letting motivated >>> individuals contribute to the project. Since Hivemall was initially >>> released in 2014, it has received contributions from 14 contributors. >>> >>> Our intent of this incubator proposal is building a diverse developer >>> community following the Apache meritocracy model. We welcome external >>> contributions and plan to elect committers from those who contribute >>> significantly to the project. >>> >>> === Community === >>> >>> While there are 15 contributors in total, there are 3-4 active >>> developers continuously involved for the major feature development at >>> the moment. We hope to extend our contributor base and encourages >>> suggestions and contributions from any potential user. >>> >>> === Core Developers === >>> >>> The current main developers are from employees of Treasure Data, NTT >>> and Hortonworks. Some of them are Hadoop/Pig PMCs and/or Hive >>> committers. >>> >>> === Alignment === >>> >>> Incubating at ASF is the natural choice for the Hivemall project >>> because the Hivemall is targeting to run on Apache Hive, Apache Spark, >>> and Apache Pig. We encourage integrations with other ASF data >>> processing frameworks like Apache Impala and Apache Drill. >>> >>> == Known Risks == >>> >>> The contributions of the main developer is significant at the moment >>> but the dependencies would decrease as the community grows. >>> >>> === Orphaned products === >>> >>> While the main developer is developing Hivemall as a full-time job at >>> TreasureData, the company is well being aware of the open source >>> philosophy and the importance of open governance of open source >>> products. Orphanining ASF product can be considered itself as a risk. >>> Hence, we think the the risks of it being orphaned are minimal. >>> >>> === Inexperience with Open Source === >>> >>> Hivemall also has been developed as an open source project since 2013. >>> The majority of the project member have jobs developing open source >>> products and some of them are working on other ASF projects like >>> Apache Hadoop and Apache Pig. We thus considered that the project >>> members have enough experiences for open source development. >>> >>> === Homogenous Developers === >>> >>> The current list of committers consists of developers from three >>> different companies. The committers are geographically distributed >>> across the U.S. and Asia. They are experienced with working in a >>> distributed environment. >>> >>> While not included in the initial committer, there are other external >>> contributors to the project. So, we hope to establish a developer >>> community that includes those contributors from several other >>> corporations during the incubation process. >>> >>> === Reliance on Salaried Developers === >>> >>> The major developer is paid by his employer to contribute to this >>> project and the other developers are payed by their employers for >>> Hadoop-related open source development. While they might change their >>> affiliations over time, they are willing to have their expertise for >>> the open source development. So, the project would continue regardless >>> their affiliations. >>> >>> === Relationships with Other Apache Products === >>> >>> Hivemall is a collection for machine learning functions on Apache >>> Hive, Apache Spark, and Apache Pig. Apache MADlib is a collection of >>> machine learning functions for relational databases, i.e., Apache HAWQ >>> and PostgreSQL. There is no conflict in their target runtimes. >>> >>> === A Excessive Fascination with the Apache Brand === >>> >>> Our interest for this incubation is attracting more contributors, >>> building a strong community with open governance, and increasing the >>> visibility of Hivemall in the market/community. We will be sensitive >>> to inadvertent abuse of the Apache brand for any commercial use and >>> will work with the Incubator PMC and project mentors to ensure the >>> brand policies are respected. >>> >>> == Documentation == >>> >>> Information on Hivemall can be found at: >>> https://github.com/myui/hivemall/wiki >>> >>> == Initial Source == >>> >>> We released the initial version of Hivemall in 2013 at >>> https://github.com/myui/hivemall and introduced Hivemall at the Hadoop >>> Summit 2014. >>> >>> == Source and Intellectual Property Submission Plan == >>> >>> We know no legal encumberment to transfer of the source to Apache. We >>> are going to get Contributor License Agreement (CLA) for all property >>> of Hivemall. >>> >>> Also, we plan to get a sign from AIST for Software Grant Agreement (SGA). >>> >>> == External Dependencies == >>> >>> Hivemall depends on the following third party libraries: >>> >>> Core module: >>> * netty (The MIT License) >>> * smile (Apache License v2.0) >>> * org.takuaani.xz (Public Domain) >>> * xgboost (Apache License v2.0) >>> * hadoop (Apache License v2.0) >>> * hive (Apache License v2.0) >>> * log4j (Apache License v2.0) >>> * guava (Apache License v2.0) >>> * lucene-analyzers-kuromoji (Apache License v2.0) >>> * junit (Eclipse Public License v1.0) >>> * mockito (The MIT License) >>> * powermock (Apache License v2.0) >>> * kryo (BSD License) >>> >>> Hivemall on Spark: >>> * spark (Apache License v2.0) >>> * commons-cli (Apache License v2.0) >>> * commons-logging (Apache License v2.0) >>> * commons-compress (Apache License v2.0) >>> * scala-library (BSD License) >>> * scalatest (Apache License v2.0) >>> * xerial-core (Apache License v2.0) >>> >>> The dependencies all have Apache compatible licenses. >>> >>> == Cryptography == >>> >>> N/A >>> >>> == Required resources == >>> >>> === Mailing lists === >>> >>> * private@hivemall.incubator.apache.org (with moderated subscriptions) >>> * commits@hivemall.incubator.apache.org >>> * dev@hivemall.incubator.apache.org >>> * user@hivemall.incubator.apache.org >>> >>> === Git Repository === >>> >>> https://git-wip-us.apache.org/repos/asf/incubator-hivemall.git >>> >>> === JIRA assistance === >>> >>> JIRA project Hivemall (HIVEMALL) >>> >>> == Initial Committers == >>> >>> * Makoto Yui (myui@treasure-data.com) >>> * Takeshi Yamamuro (yamamuro.takshi@lab.ntt.co.jp) >>> * Daniel Dai (daijy@hortonworks.com) >>> * Tsuyoshi Ozawa (ozawa.tsuyoshi@lab.ntt.co.jp) >>> * Kai Sasaki (sasaki@treasure-data.com) >>> >>> == Affiliations == >>> >>> === Treasure Data === >>> * Makoto Yui >>> * Kai Sasaki >>> >>> === NTT === >>> * Takeshi Yamamuro >>> * Tsuyoshi Ozawa Apache Hadoop PMC member >>> >>> === Hortonworks === >>> * Daniel Dai (ASF member) Apache Pig PMC member >>> >>> == Sponsors == >>> >>> === Champion === >>> * Roman Shaposhnik (Pivotal, ASF member, IPMC member) Apache >>> Bigtop/Incubator PMC member >>> >>> === Nominated Mentors === >>> >>> * Reynold Xin (Dataricks, ASF member) Apache Spark PMC member >>> * Markus Weimer (Microsoft, ASF member) Apache REEF PMC member >>> * Xiangrui Meng (Databricks, ASF member) Apache Spark PMC member >>> >>> === Sponsoring Entity === >>> >>> We are requesting the Incubator to sponsor this project. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org >> For additional commands, e-mail: general-help@incubator.apache.org >> > > > -- Jean-Baptiste Onofré jbonofre@apache.org http://blog.nanthrax.net Talend - http://www.talend.com --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org For additional commands, e-mail: general-help@incubator.apache.org