Return-Path: X-Original-To: apmail-singa-dev-archive@minotaur.apache.org Delivered-To: apmail-singa-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B46BF182A8 for ; Fri, 1 Jan 2016 10:10:43 +0000 (UTC) Received: (qmail 58773 invoked by uid 500); 1 Jan 2016 10:10:41 -0000 Delivered-To: apmail-singa-dev-archive@singa.apache.org Received: (qmail 58745 invoked by uid 500); 1 Jan 2016 10:10:41 -0000 Mailing-List: contact dev-help@singa.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@singa.incubator.apache.org Delivered-To: mailing list dev@singa.incubator.apache.org Received: (qmail 58735 invoked by uid 99); 1 Jan 2016 10:10:41 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Jan 2016 10:10:41 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 5F0FC18048E for ; Fri, 1 Jan 2016 10:10:41 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.426 X-Spam-Level: X-Spam-Status: No, score=0.426 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-0.554] autolearn=disabled Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id oi_nYGT-SYck for ; Fri, 1 Jan 2016 10:10:40 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with SMTP id 1E30020CAF for ; Fri, 1 Jan 2016 10:10:40 +0000 (UTC) Received: (qmail 58732 invoked by uid 99); 1 Jan 2016 10:10:40 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Jan 2016 10:10:40 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id D79D52C14F6 for ; Fri, 1 Jan 2016 10:10:39 +0000 (UTC) Date: Fri, 1 Jan 2016 10:10:39 +0000 (UTC) From: "wangwei (JIRA)" To: dev@singa.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (SINGA-122) Optimize memory space for Param sharing MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SINGA-122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangwei updated SINGA-122: -------------------------- Description: In deep learning models, some layers would share one or more Param objects, e.g., the en-coder and de-coder in an auto-encoder model would share weight matrix, and RNN units (layers) would share their all parameters. For parallel training with replicated layers, these layers share parameters. It is necessary to optimise the memory space of these shared parameters. Otherwise, we have to allocate both memory space for the parameter values and gradients for each share. It would consume a lot of memory, especially for the RNN model case and the parallel training case (there would be more than 10 shares for one Param object). To minimise the memory footprint, there are three sharing levels, 1. share memory space for CPU values only, e.g., one Param object is replicated on different GPU cards. 2. share memory space for both CPU values and GPU values 3. share memory space for both values and gradients. In terms of memory footprint, 1 > 2 > 3. for level 1 and 2, the code for computing gradients (i.e., Layer::ComputeGradient) is transparent to parameter sharing. However, for level 3, the code must handle the gradient aggregation correctly. Otherwise, the gradients computed for one share would be overwritten by others. We need to update both the Param class and NeuralNet class. Generally, the NeuralNet class creates Param objects and determines the sharing level (together with user configuration). Layer::ComputeGradient assign or aggregate gradients based on the sharing level (e.g., a flag). Details will be updated later.. was: In deep learning models, some layers would share one or more Param objects, e.g., the en-coder and de-coder in a auto-encoder model would share weight matrix, and RNN units (layers) would share their all parameters. For parallel training with replicated layers, these layers share parameters. It is necessary to optimise the memory space of these shared parameters. Otherwise, we have to allocate both memory space for the parameter values and gradients for each share. It would consume a lot of memory, especially for the RNN model case and the parallel training case (there would be more than 10 shares for one Param object). To minimise the memory footprint, there are three levels, 1. share memory space CPU values, e.g., one Param object is replicated on different GPU cards. 2. share memory space for both CPU values and GPU values 3. share memory space for both values and gradients. In terms of memory footprint, 1 > 2 > 3. for level 1 and 2, the code for computing gradients is transparent to parameter sharing. However, for case 3, the code must handle the gradient aggregation correctly. Otherwise, the gradients computed for one share would be overwritten by others. We need to update both the Param class and NeuralNet class (to decide the sharing case for Param objects). Generally, the NeuralNet class creates Param objects and determine the sharing level (together with user configuration). Layer::ComputeGradient assign or aggregate gradients based on the sharing level (e.g., a flag). Details will be updated later.. > Optimize memory space for Param sharing > --------------------------------------- > > Key: SINGA-122 > URL: https://issues.apache.org/jira/browse/SINGA-122 > Project: Singa > Issue Type: Improvement > Reporter: wangwei > > In deep learning models, some layers would share one or more Param objects, e.g., the en-coder and de-coder in an auto-encoder model would share weight matrix, and RNN units (layers) would share their all parameters. For parallel training with replicated layers, these layers share parameters. > It is necessary to optimise the memory space of these shared parameters. Otherwise, we have to allocate both memory space for the parameter values and gradients for each share. It would consume a lot of memory, especially for the RNN model case and the parallel training case (there would be more than 10 shares for one Param object). > To minimise the memory footprint, there are three sharing levels, > 1. share memory space for CPU values only, e.g., one Param object is replicated on different GPU cards. > 2. share memory space for both CPU values and GPU values > 3. share memory space for both values and gradients. > In terms of memory footprint, 1 > 2 > 3. > for level 1 and 2, the code for computing gradients (i.e., Layer::ComputeGradient) is transparent to parameter sharing. However, for level 3, the code must handle the gradient aggregation correctly. Otherwise, the gradients computed for one share would be overwritten by others. > We need to update both the Param class and NeuralNet class. Generally, the NeuralNet class creates Param objects and determines the sharing level (together with user configuration). Layer::ComputeGradient assign or aggregate gradients based on the sharing level (e.g., a flag). > Details will be updated later.. -- This message was sent by Atlassian JIRA (v6.3.4#6332)