Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id E55FB200CD9 for ; Thu, 3 Aug 2017 19:00:15 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id E413116BF84; Thu, 3 Aug 2017 17:00:15 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 3D16E16BF79 for ; Thu, 3 Aug 2017 19:00:15 +0200 (CEST) Received: (qmail 9007 invoked by uid 500); 3 Aug 2017 17:00:14 -0000 Mailing-List: contact issues-help@systemml.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@systemml.apache.org Delivered-To: mailing list issues@systemml.apache.org Received: (qmail 8480 invoked by uid 99); 3 Aug 2017 17:00:12 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Aug 2017 17:00:12 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 7C0B51A0955 for ; Thu, 3 Aug 2017 17:00:12 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 4-DcghgqcUrV for ; Thu, 3 Aug 2017 17:00:11 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 7FB5D5FD55 for ; Thu, 3 Aug 2017 17:00:11 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id BF28CE0E4A for ; Thu, 3 Aug 2017 17:00:10 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id E3DE824DC0 for ; Thu, 3 Aug 2017 17:00:09 +0000 (UTC) Date: Thu, 3 Aug 2017 17:00:09 +0000 (UTC) From: "Mike Dusenberry (JIRA)" To: issues@systemml.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (SYSTEMML-1760) Improve engine robustness of distributed SGD training MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 03 Aug 2017 17:00:16 -0000 [ https://issues.apache.org/jira/browse/SYSTEMML-1760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113085#comment-16113085 ] Mike Dusenberry commented on SYSTEMML-1760: ------------------------------------------- [~Tenma] Awesome! That is a great amount of speedup. Now that we've identified that the parfor optimizer is not choosing the optimal plan for this type scenario, we can use these experiments to make improvements so that a naive usage of the parfor yields a plan with the same performance (or better!). > Improve engine robustness of distributed SGD training > ----------------------------------------------------- > > Key: SYSTEMML-1760 > URL: https://issues.apache.org/jira/browse/SYSTEMML-1760 > Project: SystemML > Issue Type: Improvement > Components: Algorithms, Compiler, ParFor > Reporter: Mike Dusenberry > Assignee: Fei Hu > Attachments: Runtime_Table.png > > > Currently, we have a mathematical framework in place for training with distributed SGD in a [distributed MNIST LeNet example | https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet_distrib_sgd.dml]. This task aims to push this at scale to determine (1) the current behavior of the engine (i.e. does the optimizer actually run this in a distributed fashion, and (2) ways to improve the robustness and performance for this scenario. The distributed SGD framework from this example has already been ported into Caffe2DML, and thus improvements made for this task will directly benefit our efforts towards distributed training of Caffe models (and Keras in the future). -- This message was sent by Atlassian JIRA (v6.4.14#64029)