Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id DA77C200CD3 for ; Fri, 28 Jul 2017 19:17:05 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id D8D0516D321; Fri, 28 Jul 2017 17:17:05 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 29C9216D31F for ; Fri, 28 Jul 2017 19:17:05 +0200 (CEST) Received: (qmail 43266 invoked by uid 500); 28 Jul 2017 17:17:03 -0000 Mailing-List: contact issues-help@systemml.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@systemml.apache.org Delivered-To: mailing list issues@systemml.apache.org Received: (qmail 43257 invoked by uid 99); 28 Jul 2017 17:17:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Jul 2017 17:17:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 0F826C0118 for ; Fri, 28 Jul 2017 17:17:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id lxIR1xmzXI75 for ; Fri, 28 Jul 2017 17:17:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 9045A5F21F for ; Fri, 28 Jul 2017 17:17:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id CE2F3E0D28 for ; Fri, 28 Jul 2017 17:17:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 87ED221ED9 for ; Fri, 28 Jul 2017 17:17:00 +0000 (UTC) Date: Fri, 28 Jul 2017 17:17:00 +0000 (UTC) From: "Fei Hu (JIRA)" To: issues@systemml.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (SYSTEMML-1760) Improve engine robustness of distributed SGD training MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 28 Jul 2017 17:17:06 -0000 [ https://issues.apache.org/jira/browse/SYSTEMML-1760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hu updated SYSTEMML-1760: ----------------------------- Attachment: Runtime_Table.png > Improve engine robustness of distributed SGD training > ----------------------------------------------------- > > Key: SYSTEMML-1760 > URL: https://issues.apache.org/jira/browse/SYSTEMML-1760 > Project: SystemML > Issue Type: Improvement > Components: Algorithms, Compiler, ParFor > Reporter: Mike Dusenberry > Assignee: Fei Hu > Attachments: Runtime_Table.png > > > Currently, we have a mathematical framework in place for training with distributed SGD in a [distributed MNIST LeNet example | https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet_distrib_sgd.dml]. This task aims to push this at scale to determine (1) the current behavior of the engine (i.e. does the optimizer actually run this in a distributed fashion, and (2) ways to improve the robustness and performance for this scenario. The distributed SGD framework from this example has already been ported into Caffe2DML, and thus improvements made for this task will directly benefit our efforts towards distributed training of Caffe models (and Keras in the future). -- This message was sent by Atlassian JIRA (v6.4.14#64029)