incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thejas Nair <thejas.n...@gmail.com>
Subject Re: [Fwd: Re: [DISCUSS] [PROPOSAL] Singa for Apache Incubator]
Date Thu, 26 Feb 2015 17:12:47 GMT
The incubator proposal has been updated with the feedback so far.
We have 3 mentors now, but I think it would be good to have additional
mentors. Please let me know if anyone is able to help mentor this
project.

I am planning to start a vote on the proposal in a day or two.


On Fri, Feb 6, 2015 at 5:21 PM,  <ooibc@comp.nus.edu.sg> wrote:
>
> Regarding the number of users using this project -- at this moment, the
> community is not big.  A few local start-ups have been trying to use it
> (mainly due to announcement in our seminar list), eg. one is using it for
> image recognition (given a phone snapped by a user, it wants to be return
> the same the product, and a list of similar products, such as a luxury bag
> on a passerby).  Researchers from outside of NUS may have been using it
> since we published an application paper on cross domain/modal retrieval in
> VLDB 2014.
>
> We have not announced the project to the outside community yet -- we would
> announce it in dbworld etc in due course.
>
> Thanks and have a good weekend.
>
> regards
> beng chin
>
>>
>> Thanks for the comments and suggestions.
>> With permission from Thejas, I would like to respond to point 2.
>>
>> We have a huge team down at NUS (National University of Singapore) --
>> we have about seven database/data mining data professors (not including
>> those in systems, networking, and machine learning).
>> I myself have nine PhD students in a steady state, and I have a few large
>> grants, with a total budget of about 15 million S$ (~12 million USD), that
>> allows me to hire a number of research fellows and research assistants for
>> the next few years.  In a constant state, I have about 20 people (PhD
>> students/RA/RF) working with me alone.  Other professors have their own
>> grants (unlike other countries, it is relatively easy to get large grants
>> in Singapore; many overseas Universities, including UIUC, MIT, ETH etc
>> have research labs funded by Singapore Research Foundation [equivalent of
>> NSF]).
>>
>> SINGA is a long term project for us -- while it is a platform as it is, we
>> are using it for healthcare predictive analytics (by working with a
>> hospital associated with the University).  Therefore, we will be working
>> on SINGA, not solely as a distributed DL platform, but as a tool that will
>> enable us to do data analytics on some business domains (eg. healthcase,
>> consumer etc)
>>
>> For the initial set of committers, three are tenured professors, five are
>> students, with 2-5 years to go before they complete their PhD.  Quite
>> often, some would stay back as a research fellow for a couple of years
>> before they start looking for a job outside.  We will work with mentors
>> and new developers (from outside of NUS or Zhejiang University) in
>> enhancing the system.
>>
>> The project should survive in that sense.
>>
>> (I have an on-going project CIIDAA that has been around since 2008; it was
>> started as another project, epiC,  with a different grant, and then we
>> continue the development with a new grant for CIIDAA --
>> http://www.comp.nus.edu.sg/~ciidaa/
>> )
>>
>> Thanks.
>>
>> regards
>> beng chin
>> ps: i am not sure if my email will get through to the group.
>>
>>
>> ---------------------------- Original Message ----------------------------
>> Subject: Re: [DISCUSS] [PROPOSAL] Singa for Apache Incubator
>> From:    "Henry Saputra" <henry.saputra@gmail.com>
>> Date:    Thu, February 5, 2015 2:57 pm
>> To:      "general@incubator.apache.org" <general@incubator.apache.org>
>> Cc:      ooibc@comp.nus.edu.sg
>> --------------------------------------------------------------------------
>>
>> Several comments:
>> -) How many users already using this project? I would reccomend to
>> drop request for singa-user list at the beginning.
>> -) All the initial committers come from university and seemed like
>> some of them already ready to leave university. I am not too sure if
>> this project go survive if all of the inital committers are from
>> university as students.
>> -) Need to solicit more mentors if this project ever get to Apache
>> incubator.
>>
>> - Henry
>>
>> On Tue, Feb 3, 2015 at 3:58 PM, Thejas Nair <thejas.nair@gmail.com> wrote:
>>> The "Relationship with Other Apache Products" section has been
>>> updated. The reference to H2O in that section has been removed, and
>>> other projects have been added.
>>>  Thanks for the feedback!
>>>
>>>
>>> On Wed, Jan 28, 2015 at 10:27 AM, Thejas Nair <thejas.nair@gmail.com>
>> wrote:
>>>> Thanks for pointing that out Henry! Yes, looks like H20 is not an
>>>> apache project, I should have verified that.
>>>> I will edit that, and revisit that section along with the folks in
>>>> Singa community.
>>>>
>>>>
>>>> On Tue, Jan 27, 2015 at 6:55 PM, Henry Saputra
>> <henry.saputra@gmail.com> wrote:
>>>>> Quick immediate comment that "Apache H2O" is not really Apache
>>>>> project.
>>>>>
>>>>> I assume you are referring to https://github.com/h2oai/h2o (or
>>>>> https://github.com/h2oai/h2o-dev) ?
>>>>>
>>>>> - Henry
>>>>>
>>>>> On Tue, Jan 27, 2015 at 5:29 PM, Thejas Nair <thejas.nair@gmail.com>
>> wrote:
>>>>>> Hello everyone,
>>>>>>
>>>>>> I would like to propose the inclusion of Singa as an Apache Incubator
>> project.
>>>>>>
>>>>>> Here is the proposal -
>>>>>> https://wiki.apache.org/incubator/SingaProposal
>>>>>>
>>>>>> Please review the proposal and give feedback. I am planning to start
>>>>>> a
>>>>>> vote after 7 days if the proposal looks good.
>>>>>> We are also seeking additional Apache mentors for the project.
>>>>>>
>>>>>> Thanks,
>>>>>> Thejas
>>>>>> ==========================================================
>>>>>> Singa Incubator Proposal
>>>>>>
>>>>>> Abstract
>>>>>>
>>>>>> SINGA is a distributed deep learning platform.
>>>>>>
>>>>>> Proposal
>>>>>>
>>>>>> SINGA is an efficient, scalable and easy-to-use distributed platform
>>>>>> for training deep learning models, e.g., Deep Convolutional Neural
>>>>>> Network and Deep Belief Network. It parallelizes the computation
>>>>>> (i.e., training) onto a cluster of nodes by distributing the training
>>>>>> data and model automatically to speed up the training. Built-in
>>>>>> training algorithms like Back-Propagation and Contrastive Divergence
>>>>>> are implemented based on common abstractions of deep learning models.
>>>>>> Users can train their own deep learning models by simply customizing
>>>>>> these abstractions like implementing the Mapper and Reducer in
>>>>>> Hadoop.
>>>>>>
>>>>>> Background
>>>>>>
>>>>>> Deep learning refers to a set of feature (or representation) learning
>>>>>> models that consist of multiple (non-linear) layers, where different
>>>>>> layers learn different levels of abstractions (representations) of
>>>>>> the
>>>>>> raw input data. Larger (in terms of model parameters) and deeper
(in
>>>>>> terms of number of layers) models have shown better performance,
>>>>>> e.g.,
>>>>>> lower image classification error in Large Scale Visual Recognition
>>>>>> Challenge. However, a larger model requires more memory and larger
>>>>>> training data to reduce over-fitting. Complex numeric operations
make
>>>>>> the training computation intensive. In practice, training large deep
>>>>>> learning models takes weeks or months on a single node (even with
>>>>>> GPU).
>>>>>>
>>>>>> Rational
>>>>>>
>>>>>> Deep learning has gained a lot of attraction in both academia and
>>>>>> industry due to its success in a wide range of areas such as computer
>>>>>> vision and speech recognition. However, training of such models is
>>>>>> computationally expensive, especially for large and deep models
>>>>>> (e.g.,
>>>>>> with billions of parameters and more than 10 layers). Both Google
and
>>>>>> Microsoft have developed distributed deep learning systems to make
>>>>>> the
>>>>>> training more efficient by distributing the computations within a
>>>>>> cluster of nodes. However, these systems are closed source softwares.
>>>>>> Our goal is to leverage the community of open source developers to
>>>>>> make SINGA efficient, scalable and easy to use. SINGA is a full
>>>>>> fledged distributed platform, that could benefit the community and
>>>>>> also benefit from the community in their involvement in contributing
>>>>>> to the further work in this area. We believe the nature of SINGA
and
>>>>>> our visions for the system fit naturally to Apache's philosophy and
>>>>>> development framework.
>>>>>>
>>>>>> Initial Goals
>>>>>>
>>>>>> We have developed a system for SINGA running on a commodity computer
>>>>>> cluster. The initial goals include, * improving the system in terms
>>>>>> of
>>>>>> scalability and efficiency, e.g., using Infiniband for network
>>>>>> communication and multi-threading for one node computation. We would
>>>>>> consider extending SINGA to GPU clusters later. * benchmarking with
>>>>>> larger datasets (hundreds of millions of training instances) and
>>>>>> models (billions of parameters). * adding more built-in deep learning
>>>>>> models. Users can train the built-in models on their datasets
>>>>>> directly.
>>>>>>
>>>>>> Current Status
>>>>>>
>>>>>> Meritocracy
>>>>>>
>>>>>> We would like to follow ASF meritocratic principles to encourage
more
>>>>>> developers to contribute in this project. We know that only active
>>>>>> and
>>>>>> excellent developers can make SINGA a successful project. The
>>>>>> committer list and PMC will be updated based on developers'
>>>>>> performance and commitment. We are also improving the documentation
>>>>>> and code to help new developers get started quickly.
>>>>>>
>>>>>> Community
>>>>>>
>>>>>> SINGA is currently being developed in the Database System Research
>>>>>> Lab
>>>>>> at the National University of Singapore (NUS) in collaboration with
>>>>>> Zhejiang University in China. Our lab has extensive experience in
>>>>>> building database related systems, including distributed systems.
Six
>>>>>> PhD students and research assistants (Jinyang Gao, Kaiping Zheng,
>>>>>> Sheng Wang, Wei Wang, Zhaojing Luo and Zhongle Xie) , a research
>>>>>> fellow (Anh Dinh) and three professors (Beng Chin Ooi, Gang Chen,
>>>>>> Kian
>>>>>> Lee Tan) have been working for a year on this project. We are open
to
>>>>>> recruiting more developers from diverse backgrounds.
>>>>>>
>>>>>> Core Developers
>>>>>>
>>>>>> Beng Chin Ooi, Gang Chen and Kian Lee Tan are professors who have
>>>>>> worked on distributed systems for more than 20 years. They have
>>>>>> collaborated with the industry and have built various large scale
>>>>>> systems. Anh Dinh's research is also on distributed systems, albeit
>>>>>> with more focus on security aspects. Wei Wang's research is on deep
>>>>>> learning problems including deep learning applications and large
>>>>>> scale
>>>>>> training. Sheng Wang and Jinyang are working on efficient indexing,
>>>>>> querying of large scale data and machine learning. Kaiping, Zhaojing
>>>>>> and Zhongle are new PhD students who jointed SINGA recently. They
>>>>>> will
>>>>>> work on this project for a longer time (next 4-5 years). While we
>>>>>> share common research interests, each member also brings diverse
>>>>>> expertise to the team.
>>>>>>
>>>>>> Alignment
>>>>>>
>>>>>> ASF is already the home of many distributed platforms, e.g., Hadoop,
>>>>>> Spark and Mahout, each of which targets a different application
>>>>>> domain. SINGA, being a distributed platform for large-scale deep
>>>>>> learning, focuses on another important domain for which there still
>>>>>> lacks a robust and scalable open-source platform. The recent success
>>>>>> of deep learning models especially for vision and speech recognition
>>>>>> tasks has generated interests in both applying existing deep learning
>>>>>> models and in developing new ones. Thus, an open-source platform
for
>>>>>> deep learning will be able to attract a large community of users
and
>>>>>> developers. SINGA is a complex system needing many iterations of
>>>>>> design, implementation and testing. Apache's collaboration framework
>>>>>> which encourages active contribution from developers will inevitably
>>>>>> help improve the quality of the system, as shown in the success of
>>>>>> Hadoop, Spark, etc.. Equally important is the community of users
>>>>>> which
>>>>>> helps identify real-life applications of deep learning, and helps
to
>>>>>> evaluate the system's performance and ease-of-use. We hope to
>>>>>> leverage
>>>>>> ASF for coordinating and promoting both communities, and in return
>>>>>> benefit the communities with another useful tool.
>>>>>>
>>>>>> Known Risks
>>>>>>
>>>>>> Orphaned products
>>>>>>
>>>>>> Four core developers (Anh, Wei Wang, Jinyang and Sheng Wang) may
>>>>>> leave
>>>>>> the lab in two to four years time. It is possible that some of them
>>>>>> may not have enough time to focus on this project after that. But,
>>>>>> SINGA is part of our other bigger research projects on building an
>>>>>> infrastructure for data intensive applications, which include
>>>>>> health-care analytics and brain-inspired computing. Beng Chin and
>>>>>> Kian
>>>>>> Lee would continue working on it and getting more people involved.
>>>>>> For
>>>>>> example, three new developers (Kaiping, Zhaojing and Zhongle) joined
>>>>>> us recently. Individual developers are welcome to make SINGA a
>>>>>> diverse
>>>>>> community that is robust and independent from any single developer.
>>>>>>
>>>>>> Inexperience with Open Source
>>>>>>
>>>>>> All the developers are active users and followers of open source
>>>>>> projects. Our research lab has a strong commitment to open source,
>>>>>> and
>>>>>> has released the source code of several systems under open source
>>>>>> license as a way of contributing back to the open source community.
>>>>>> But we do not have much real experience in open source projects with
>>>>>> large and well organized communities like those in Apache. This is
>>>>>> one
>>>>>> reason we choose Apache which is experienced in open source project
>>>>>> incubation. We hope to get the help from Apache (e.g., champion and
>>>>>> mentors) to establish a healthy path for SINGA.
>>>>>>
>>>>>> Homogenous Developers
>>>>>>
>>>>>> Although the current developers are researchers in the universities,
>>>>>> they have different research interests and project experiences, as
>>>>>> mentioned in the section that introduces the core developers. We
know
>>>>>> that a diverse community is helpful. Hence we are open to the idea
of
>>>>>> recruiting developers from other regions and organizations.
>>>>>>
>>>>>> Reliance on Salaried Developers
>>>>>>
>>>>>> As a research project in the university, SINGA's current developing
>>>>>> community consists of professors, PhD students, research assistants
>>>>>> and postdoctoral fellows. They are driven by their interests to work
>>>>>> on this project and have contributed actively since the start of
the
>>>>>> project. The research assistants and fellows are expected to leave
>>>>>> when their contracts expire. However, they are keen to continue to
>>>>>> work on the project voluntarily. Moreover, as a long term research
>>>>>> project, new research assistants and fellows are likely to join the
>>>>>> project.
>>>>>>
>>>>>> A Excessive Fascination with the Apache Brand
>>>>>>
>>>>>> We choose Apache not for publicity. We have two purposes. First,
we
>>>>>> want to leverage Apache's reputation to recruit more developers to
>>>>>> make a diverse community. Second, we hope that Apache can help us
to
>>>>>> establish a healthy path in developing SINGA. Beng Chin and Kian-Lee
>>>>>> are established database and distributed system researchers, and
>>>>>> together with the other contributors, they sincerely believe that
>>>>>> there is a need for a widely accepted open source distributed deep
>>>>>> learning platform. The field of deep learning is still at its
>>>>>> infancy,
>>>>>> and an open source platform will fuel the research in the area.
>>>>>> Moreover, such a platform will enable researchers to develop new
>>>>>> models and algorithms, rather than spending time implementing a deep
>>>>>> learning system from scratch. Furthermore, the need for scalability
>>>>>> for such a platform is obvious.
>>>>>>
>>>>>> Relationship with Other Apache Products
>>>>>>
>>>>>> Apache H2O implemented two simple deep learning models, namely the
>>>>>> Multi-Layer Perceptron and Deep Auto-encoders. There are two
>>>>>> significant differences between H2O and SINGA. First, H2O adopts
the
>>>>>> Map-Reduce framework which runs a set of computing nodes in parallel
>>>>>> againsts of the training set. Model parameters trained by all
>>>>>> computing nodes are averaged as the final model parameters. This
>>>>>> training algorithm is different from the distributed training
>>>>>> algorithm used by DistBelief, Adam and SINGA, which frequently
>>>>>> synchronizes the parameters trained from different nodes. SINGA
>>>>>> adopts
>>>>>> the parameter server framework to support a wide range of distributed
>>>>>> training algorithms and parallelization methods (e.g., data
>>>>>> parallelism, model parallelism and hybrid parallelism. H2O only
>>>>>> support data parallelism) . Second, in H2O, users are restricted
to
>>>>>> use the two built-in models. In SINGA, we provide simple programming
>>>>>> model to let users implement their own deep learning models. A new
>>>>>> deep learning model can be implemented by customizing the base Layer
>>>>>> class for each layer involved in the model. It is similar to writing
>>>>>> Hadoop programs where users only need to override the base Mapper
and
>>>>>> Reducer. We also provide built-in models for users to use directly.
>>>>>>
>>>>>> Documentation
>>>>>>
>>>>>> The project is hosted at
>>>>>> http://www.comp.nus.edu.sg/~dbsystem/project/singa.html.
>>>>>> Documentations can be found at the Github Wiki Page:
>>>>>> https://github.com/nusinga/singa/wiki. We continue to refine and
>>>>>> improve the documentation.
>>>>>>
>>>>>> Initial Source
>>>>>>
>>>>>> We use Github to maintain our source code,
>> https://github.com/nusinga/singa
>>>>>>
>>>>>> Source and Intellectual Property Submission Plan
>>>>>>
>>>>>> We plan to make our code base be under Apache License, Version 2.0.
>>>>>>
>>>>>> External Dependencies
>>>>>>
>>>>>> required by the core code base: glog, gflags, google protobuf,
>>>>>> open-blas, mpich, armci-mpi.
>>>>>> required by data preparation and preprocessing: opencv, hdfs, python.
>>>>>>
>>>>>> Cryptography
>>>>>>
>>>>>> Not Applicable
>>>>>>
>>>>>> Required Resources
>>>>>>
>>>>>> Mailing Lists
>>>>>>
>>>>>> Currently, we use google group for internal discussion. The mailing
>>>>>> address is nusinga@googlegroup.com. We will migrate the content to
>>>>>> the
>>>>>> apache mailing lists in the future.
>>>>>>
>>>>>> singa-dev
>>>>>> singa-user
>>>>>> singa-commits
>>>>>> singa-private (for private discussion within PCM)
>>>>>>
>>>>>> Git Repository
>>>>>>
>>>>>> We want to continue using git for version control. Hence, a git repo
>>>>>> is required.
>>>>>>
>>>>>> Issue Tracking
>>>>>>
>>>>>> JIRA Singa (SINGA)
>>>>>>
>>>>>> Initial Committers
>>>>>>
>>>>>> Beng Chin Ooi (ooibc @comp.nus.edu.sg)
>>>>>> Kian Lee Tan (tankl @comp.nus.edu.sg)
>>>>>> Gang Chen (cg @zju.edu.cn)
>>>>>> Wei Wang (wangwei @comp.nus.edu.sg)
>>>>>> Dinh Tien Tuan Anh (dinhtta @comp.nus.edu.sg)
>>>>>> Jinyang Gao (jinyang.gao @comp.nus.edu.sg)
>>>>>> Sheng Wang (wangsh @comp.nus.edu.sg)
>>>>>> Kaiping Zheng (kaiping @comp.nus.edu.sg)
>>>>>> Zhaojing Luo (zhaojing @comp.nus.edu.sg)
>>>>>> Zhongle Xie (zhongle @comp.nus.edu.sg)
>>>>>>
>>>>>> Affiliations
>>>>>>
>>>>>> Beng Chin Ooi, National University of Singapore
>>>>>> Kian Lee Tan, National University of Singapore
>>>>>> Gang Chen, Zhejiang University
>>>>>> Wei Wang, National University of Singapore
>>>>>> Dinh Tien Tuan Anh, National University of Singapore
>>>>>> Jinyang Gao, National University of Singapore
>>>>>> Sheng Wang, National University of Singapore
>>>>>> Kaiping Zheng, National University of Singapore
>>>>>> Zhaojing Luo, National University of Singapore
>>>>>> Zhongle Xie, National University of Singapore
>>>>>>
>>>>>> Sponsors
>>>>>>
>>>>>> Champion
>>>>>>
>>>>>> Thejas Nair (thejas at apache.org) - Hortonworks
>>>>>>
>>>>>> Nominated Mentors
>>>>>>
>>>>>> Thejas Nair (thejas at apache.org) - Hortonworks
>>>>>> Alan Gates (gates at apache dot org) - Hortonworks
>>>>>> (Seeking more volunteers!)
>>>>>>
>>>>>> Sponsoring Entity
>>>>>>
>>>>>> We are requesting the Incubator to sponsor this project.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>>>>>> For additional commands, e-mail: general-help@incubator.apache.org
>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>>>>> For additional commands, e-mail: general-help@incubator.apache.org
>>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>>> For additional commands, e-mail: general-help@incubator.apache.org
>>>
>>
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message