horn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward J. Yoon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HORN-27) Effective Parallel Training of Large Deep DropConnect Neural Networks
Date Wed, 27 Jul 2016 23:15:20 GMT

    [ https://issues.apache.org/jira/browse/HORN-27?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15396625#comment-15396625

Edward J. Yoon commented on HORN-27:

I got some interesting result. 
MNIST 1,000 iterations

1) Neural network
Accuracy: 0.8284

2) Dropout neural network (single task, batch size: 100, drop rate 0.5)
Accuracy: 0.9156

3) Parallel dropout neural network (20 tasks, batch size per task: 5, drop rate 0.5)
Accuracy: 0.9373

> Effective Parallel Training of Large Deep DropConnect Neural Networks
> ---------------------------------------------------------------------
>                 Key: HORN-27
>                 URL: https://issues.apache.org/jira/browse/HORN-27
>             Project: Apache Horn
>          Issue Type: Task
>            Reporter: Edward J. Yoon
> As you already might know, training a large scale deep ANN architectures, such as Convolutional
Neural Nets (CNNs) and Recurrent Neural Nets (RNNs), is challenging because the training process
not only involves how to parallelize the training of large models but also it can be quite
prone to over fitting due to large size of the network, even with large data sets. There are
popular techniques for regularizing artificial neural networks, called DropOut [1] and DropConnect
[2], by randomly dropping out hidden units and its connections during training of neural networks.

> In fact, this is exactly why we do this project.  Of course at the moment, it's just
my rough idea, I'm thinking about ensemble concept of drop out and connect which allows distributed
parallel training with small communication requirements. The core idea of this is to create
many model replicas on different subsets of the data, and partition each network model into
multiple processors randomly, thus dropping connections and achieving locality of computation
at the same time.
> There have been already attempts to parallelize SGD-based training for large-scale deep
learning models on distributed systems. Its basic concept is that each worker trains a copy
of the model and combines their results synchronously, or updates through a centralized parameter
server in asynchronous way. For large model, it generally uses layer-wise model parallelism
based on matrix operations. However, this leads to a large communication overhead between
host and device, or between hosts or devices (like blow image).
> !https://4.bp.blogspot.com/-S6-akP8wGOE/V0eU9DrzESI/AAAAAAAAF-o/qAKZ08VgJDo9ZPJFHt1SXnfZ26yueBY2gCLcB/s640/modelparallel.png!
> Differently, my basic approach is as follows: we assign the training data and model copy
into a number of worker groups. Then, each group divides a large model irregularly into few
disconnected sub-model of the parent model so that each worker runs independently of each
other. To get more randomness, we do sub-model dropouts finally.

This message was sent by Atlassian JIRA

View raw message