incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thejas Nair <thejas.n...@gmail.com>
Subject Re: [DISCUSS] [PROPOSAL] Singa for Apache Incubator
Date Wed, 28 Jan 2015 18:27:16 GMT
Thanks for pointing that out Henry! Yes, looks like H20 is not an
apache project, I should have verified that.
I will edit that, and revisit that section along with the folks in
Singa community.


On Tue, Jan 27, 2015 at 6:55 PM, Henry Saputra <henry.saputra@gmail.com> wrote:
> Quick immediate comment that "Apache H2O" is not really Apache project.
>
> I assume you are referring to https://github.com/h2oai/h2o (or
> https://github.com/h2oai/h2o-dev) ?
>
> - Henry
>
> On Tue, Jan 27, 2015 at 5:29 PM, Thejas Nair <thejas.nair@gmail.com> wrote:
>> Hello everyone,
>>
>> I would like to propose the inclusion of Singa as an Apache Incubator project.
>>
>> Here is the proposal - https://wiki.apache.org/incubator/SingaProposal
>>
>> Please review the proposal and give feedback. I am planning to start a
>> vote after 7 days if the proposal looks good.
>> We are also seeking additional Apache mentors for the project.
>>
>> Thanks,
>> Thejas
>> ==========================================================
>> Singa Incubator Proposal
>>
>> Abstract
>>
>> SINGA is a distributed deep learning platform.
>>
>> Proposal
>>
>> SINGA is an efficient, scalable and easy-to-use distributed platform
>> for training deep learning models, e.g., Deep Convolutional Neural
>> Network and Deep Belief Network. It parallelizes the computation
>> (i.e., training) onto a cluster of nodes by distributing the training
>> data and model automatically to speed up the training. Built-in
>> training algorithms like Back-Propagation and Contrastive Divergence
>> are implemented based on common abstractions of deep learning models.
>> Users can train their own deep learning models by simply customizing
>> these abstractions like implementing the Mapper and Reducer in Hadoop.
>>
>> Background
>>
>> Deep learning refers to a set of feature (or representation) learning
>> models that consist of multiple (non-linear) layers, where different
>> layers learn different levels of abstractions (representations) of the
>> raw input data. Larger (in terms of model parameters) and deeper (in
>> terms of number of layers) models have shown better performance, e.g.,
>> lower image classification error in Large Scale Visual Recognition
>> Challenge. However, a larger model requires more memory and larger
>> training data to reduce over-fitting. Complex numeric operations make
>> the training computation intensive. In practice, training large deep
>> learning models takes weeks or months on a single node (even with
>> GPU).
>>
>> Rational
>>
>> Deep learning has gained a lot of attraction in both academia and
>> industry due to its success in a wide range of areas such as computer
>> vision and speech recognition. However, training of such models is
>> computationally expensive, especially for large and deep models (e.g.,
>> with billions of parameters and more than 10 layers). Both Google and
>> Microsoft have developed distributed deep learning systems to make the
>> training more efficient by distributing the computations within a
>> cluster of nodes. However, these systems are closed source softwares.
>> Our goal is to leverage the community of open source developers to
>> make SINGA efficient, scalable and easy to use. SINGA is a full
>> fledged distributed platform, that could benefit the community and
>> also benefit from the community in their involvement in contributing
>> to the further work in this area. We believe the nature of SINGA and
>> our visions for the system fit naturally to Apache's philosophy and
>> development framework.
>>
>> Initial Goals
>>
>> We have developed a system for SINGA running on a commodity computer
>> cluster. The initial goals include, * improving the system in terms of
>> scalability and efficiency, e.g., using Infiniband for network
>> communication and multi-threading for one node computation. We would
>> consider extending SINGA to GPU clusters later. * benchmarking with
>> larger datasets (hundreds of millions of training instances) and
>> models (billions of parameters). * adding more built-in deep learning
>> models. Users can train the built-in models on their datasets
>> directly.
>>
>> Current Status
>>
>> Meritocracy
>>
>> We would like to follow ASF meritocratic principles to encourage more
>> developers to contribute in this project. We know that only active and
>> excellent developers can make SINGA a successful project. The
>> committer list and PMC will be updated based on developers'
>> performance and commitment. We are also improving the documentation
>> and code to help new developers get started quickly.
>>
>> Community
>>
>> SINGA is currently being developed in the Database System Research Lab
>> at the National University of Singapore (NUS) in collaboration with
>> Zhejiang University in China. Our lab has extensive experience in
>> building database related systems, including distributed systems. Six
>> PhD students and research assistants (Jinyang Gao, Kaiping Zheng,
>> Sheng Wang, Wei Wang, Zhaojing Luo and Zhongle Xie) , a research
>> fellow (Anh Dinh) and three professors (Beng Chin Ooi, Gang Chen, Kian
>> Lee Tan) have been working for a year on this project. We are open to
>> recruiting more developers from diverse backgrounds.
>>
>> Core Developers
>>
>> Beng Chin Ooi, Gang Chen and Kian Lee Tan are professors who have
>> worked on distributed systems for more than 20 years. They have
>> collaborated with the industry and have built various large scale
>> systems. Anh Dinh's research is also on distributed systems, albeit
>> with more focus on security aspects. Wei Wang's research is on deep
>> learning problems including deep learning applications and large scale
>> training. Sheng Wang and Jinyang are working on efficient indexing,
>> querying of large scale data and machine learning. Kaiping, Zhaojing
>> and Zhongle are new PhD students who jointed SINGA recently. They will
>> work on this project for a longer time (next 4-5 years). While we
>> share common research interests, each member also brings diverse
>> expertise to the team.
>>
>> Alignment
>>
>> ASF is already the home of many distributed platforms, e.g., Hadoop,
>> Spark and Mahout, each of which targets a different application
>> domain. SINGA, being a distributed platform for large-scale deep
>> learning, focuses on another important domain for which there still
>> lacks a robust and scalable open-source platform. The recent success
>> of deep learning models especially for vision and speech recognition
>> tasks has generated interests in both applying existing deep learning
>> models and in developing new ones. Thus, an open-source platform for
>> deep learning will be able to attract a large community of users and
>> developers. SINGA is a complex system needing many iterations of
>> design, implementation and testing. Apache's collaboration framework
>> which encourages active contribution from developers will inevitably
>> help improve the quality of the system, as shown in the success of
>> Hadoop, Spark, etc.. Equally important is the community of users which
>> helps identify real-life applications of deep learning, and helps to
>> evaluate the system's performance and ease-of-use. We hope to leverage
>> ASF for coordinating and promoting both communities, and in return
>> benefit the communities with another useful tool.
>>
>> Known Risks
>>
>> Orphaned products
>>
>> Four core developers (Anh, Wei Wang, Jinyang and Sheng Wang) may leave
>> the lab in two to four years time. It is possible that some of them
>> may not have enough time to focus on this project after that. But,
>> SINGA is part of our other bigger research projects on building an
>> infrastructure for data intensive applications, which include
>> health-care analytics and brain-inspired computing. Beng Chin and Kian
>> Lee would continue working on it and getting more people involved. For
>> example, three new developers (Kaiping, Zhaojing and Zhongle) joined
>> us recently. Individual developers are welcome to make SINGA a diverse
>> community that is robust and independent from any single developer.
>>
>> Inexperience with Open Source
>>
>> All the developers are active users and followers of open source
>> projects. Our research lab has a strong commitment to open source, and
>> has released the source code of several systems under open source
>> license as a way of contributing back to the open source community.
>> But we do not have much real experience in open source projects with
>> large and well organized communities like those in Apache. This is one
>> reason we choose Apache which is experienced in open source project
>> incubation. We hope to get the help from Apache (e.g., champion and
>> mentors) to establish a healthy path for SINGA.
>>
>> Homogenous Developers
>>
>> Although the current developers are researchers in the universities,
>> they have different research interests and project experiences, as
>> mentioned in the section that introduces the core developers. We know
>> that a diverse community is helpful. Hence we are open to the idea of
>> recruiting developers from other regions and organizations.
>>
>> Reliance on Salaried Developers
>>
>> As a research project in the university, SINGA's current developing
>> community consists of professors, PhD students, research assistants
>> and postdoctoral fellows. They are driven by their interests to work
>> on this project and have contributed actively since the start of the
>> project. The research assistants and fellows are expected to leave
>> when their contracts expire. However, they are keen to continue to
>> work on the project voluntarily. Moreover, as a long term research
>> project, new research assistants and fellows are likely to join the
>> project.
>>
>> A Excessive Fascination with the Apache Brand
>>
>> We choose Apache not for publicity. We have two purposes. First, we
>> want to leverage Apache's reputation to recruit more developers to
>> make a diverse community. Second, we hope that Apache can help us to
>> establish a healthy path in developing SINGA. Beng Chin and Kian-Lee
>> are established database and distributed system researchers, and
>> together with the other contributors, they sincerely believe that
>> there is a need for a widely accepted open source distributed deep
>> learning platform. The field of deep learning is still at its infancy,
>> and an open source platform will fuel the research in the area.
>> Moreover, such a platform will enable researchers to develop new
>> models and algorithms, rather than spending time implementing a deep
>> learning system from scratch. Furthermore, the need for scalability
>> for such a platform is obvious.
>>
>> Relationship with Other Apache Products
>>
>> Apache H2O implemented two simple deep learning models, namely the
>> Multi-Layer Perceptron and Deep Auto-encoders. There are two
>> significant differences between H2O and SINGA. First, H2O adopts the
>> Map-Reduce framework which runs a set of computing nodes in parallel
>> againsts of the training set. Model parameters trained by all
>> computing nodes are averaged as the final model parameters. This
>> training algorithm is different from the distributed training
>> algorithm used by DistBelief, Adam and SINGA, which frequently
>> synchronizes the parameters trained from different nodes. SINGA adopts
>> the parameter server framework to support a wide range of distributed
>> training algorithms and parallelization methods (e.g., data
>> parallelism, model parallelism and hybrid parallelism. H2O only
>> support data parallelism) . Second, in H2O, users are restricted to
>> use the two built-in models. In SINGA, we provide simple programming
>> model to let users implement their own deep learning models. A new
>> deep learning model can be implemented by customizing the base Layer
>> class for each layer involved in the model. It is similar to writing
>> Hadoop programs where users only need to override the base Mapper and
>> Reducer. We also provide built-in models for users to use directly.
>>
>> Documentation
>>
>> The project is hosted at
>> http://www.comp.nus.edu.sg/~dbsystem/project/singa.html.
>> Documentations can be found at the Github Wiki Page:
>> https://github.com/nusinga/singa/wiki. We continue to refine and
>> improve the documentation.
>>
>> Initial Source
>>
>> We use Github to maintain our source code, https://github.com/nusinga/singa
>>
>> Source and Intellectual Property Submission Plan
>>
>> We plan to make our code base be under Apache License, Version 2.0.
>>
>> External Dependencies
>>
>> required by the core code base: glog, gflags, google protobuf,
>> open-blas, mpich, armci-mpi.
>> required by data preparation and preprocessing: opencv, hdfs, python.
>>
>> Cryptography
>>
>> Not Applicable
>>
>> Required Resources
>>
>> Mailing Lists
>>
>> Currently, we use google group for internal discussion. The mailing
>> address is nusinga@googlegroup.com. We will migrate the content to the
>> apache mailing lists in the future.
>>
>> singa-dev
>> singa-user
>> singa-commits
>> singa-private (for private discussion within PCM)
>>
>> Git Repository
>>
>> We want to continue using git for version control. Hence, a git repo
>> is required.
>>
>> Issue Tracking
>>
>> JIRA Singa (SINGA)
>>
>> Initial Committers
>>
>> Beng Chin Ooi (ooibc @comp.nus.edu.sg)
>> Kian Lee Tan (tankl @comp.nus.edu.sg)
>> Gang Chen (cg @zju.edu.cn)
>> Wei Wang (wangwei @comp.nus.edu.sg)
>> Dinh Tien Tuan Anh (dinhtta @comp.nus.edu.sg)
>> Jinyang Gao (jinyang.gao @comp.nus.edu.sg)
>> Sheng Wang (wangsh @comp.nus.edu.sg)
>> Kaiping Zheng (kaiping @comp.nus.edu.sg)
>> Zhaojing Luo (zhaojing @comp.nus.edu.sg)
>> Zhongle Xie (zhongle @comp.nus.edu.sg)
>>
>> Affiliations
>>
>> Beng Chin Ooi, National University of Singapore
>> Kian Lee Tan, National University of Singapore
>> Gang Chen, Zhejiang University
>> Wei Wang, National University of Singapore
>> Dinh Tien Tuan Anh, National University of Singapore
>> Jinyang Gao, National University of Singapore
>> Sheng Wang, National University of Singapore
>> Kaiping Zheng, National University of Singapore
>> Zhaojing Luo, National University of Singapore
>> Zhongle Xie, National University of Singapore
>>
>> Sponsors
>>
>> Champion
>>
>> Thejas Nair (thejas at apache.org) - Hortonworks
>>
>> Nominated Mentors
>>
>> Thejas Nair (thejas at apache.org) - Hortonworks
>> Alan Gates (gates at apache dot org) - Hortonworks
>> (Seeking more volunteers!)
>>
>> Sponsoring Entity
>>
>> We are requesting the Incubator to sponsor this project.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>> For additional commands, e-mail: general-help@incubator.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message