From dev-return-3265-archive-asf-public=cust-asf.ponee.io@singa.incubator.apache.org Sat Aug 10 14:12:56 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id BD1A7180607 for ; Sat, 10 Aug 2019 16:12:55 +0200 (CEST) Received: (qmail 66921 invoked by uid 500); 10 Aug 2019 14:12:54 -0000 Mailing-List: contact dev-help@singa.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@singa.incubator.apache.org Delivered-To: mailing list dev@singa.incubator.apache.org Received: (qmail 66911 invoked by uid 99); 10 Aug 2019 14:12:54 -0000 Received: from ec2-52-202-80-70.compute-1.amazonaws.com (HELO gitbox.apache.org) (52.202.80.70) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 10 Aug 2019 14:12:54 +0000 From: GitBox To: dev@singa.apache.org Subject: [GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module Message-ID: <156544637464.2722.12465973002007830614.gitbox@gitbox.apache.org> Date: Sat, 10 Aug 2019 14:12:54 -0000 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311068821 ########## File path: src/api/config.i ########## @@ -0,0 +1,33 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + + +// Pass in cmake configurations to swig +#define USE_CUDA 1 +#define USE_CUDNN 1 +#define USE_OPENCL 0 +#define USE_PYTHON 1 +#define USE_MKLDNN 1 +#define USE_JAVA 0 +#define CUDNN_VERSION 7401 + +// SINGA version +#define SINGA_MAJOR_VERSION 1 Review comment: In additional to the above, I also did a 8 * K80 multi-GPUs training and evaluation test with a CIFAR-10 dataset on resnet 50. It reduces the training loss from 3983.8 to 35.56 in 100 Epochs, and evaluation accuracy to 90.6% (maximum at epoch 90). However, this does not include the synchronization of running mean and variance before the evaluation phase: ``` Epoch=0: 100%|██████████| 195/195 [06:06<00:00, 1.91s/it]Training loss = 3983.820557, training accuracy = 0.225260 Test accuracy = 0.347556 Epoch=1: 100%|██████████| 195/195 [06:17<00:00, 1.94s/it]Training loss = 2628.622070, training accuracy = 0.379768 Test accuracy = 0.437700 Epoch=2: 100%|██████████| 195/195 [06:12<00:00, 1.89s/it]Training loss = 2347.072266, training accuracy = 0.448558 Test accuracy = 0.459936 Epoch=3: 100%|██████████| 195/195 [06:13<00:00, 1.88s/it]Training loss = 2075.987305, training accuracy = 0.517348 Test accuracy = 0.548978 Epoch=4: 100%|██████████| 195/195 [06:19<00:00, 1.97s/it]Training loss = 1890.109985, training accuracy = 0.566847 Test accuracy = 0.594451 Epoch=5: 100%|██████████| 195/195 [06:13<00:00, 1.92s/it]Training loss = 1720.395142, training accuracy = 0.606911 Test accuracy = 0.633413 Epoch=6: 100%|██████████| 195/195 [06:10<00:00, 1.92s/it]Training loss = 1555.737549, training accuracy = 0.645753 Test accuracy = 0.659054 Epoch=7: 100%|██████████| 195/195 [06:14<00:00, 1.91s/it]Training loss = 1385.688477, training accuracy = 0.687220 Test accuracy = 0.709836 Epoch=8: 100%|██████████| 195/195 [06:20<00:00, 1.97s/it]Training loss = 1269.426270, training accuracy = 0.714523 Test accuracy = 0.735477 Epoch=9: 100%|██████████| 195/195 [06:15<00:00, 1.91s/it]Training loss = 1137.953979, training accuracy = 0.746054 Test accuracy = 0.745393 Epoch=10: 100%|██████████| 195/195 [06:11<00:00, 1.88s/it]Training loss = 1031.773071, training accuracy = 0.770353 Test accuracy = 0.750501 Epoch=11: 100%|██████████| 195/195 [06:10<00:00, 1.89s/it]Training loss = 956.600037, training accuracy = 0.788261 Test accuracy = 0.777744 Epoch=12: 100%|██████████| 195/195 [06:16<00:00, 1.92s/it]Training loss = 881.050171, training accuracy = 0.804167 Test accuracy = 0.793369 Epoch=13: 100%|██████████| 195/195 [06:16<00:00, 1.92s/it]Training loss = 828.298828, training accuracy = 0.818309 Test accuracy = 0.807692 Epoch=14: 100%|██████████| 195/195 [06:11<00:00, 1.90s/it]Training loss = 790.558838, training accuracy = 0.823918 Test accuracy = 0.795373 Epoch=15: 100%|██████████| 195/195 [06:13<00:00, 1.90s/it]Training loss = 740.679871, training accuracy = 0.833734 Test accuracy = 0.816707 Epoch=16: 100%|██████████| 195/195 [06:20<00:00, 1.95s/it]Training loss = 691.391479, training accuracy = 0.846855 Test accuracy = 0.818510 Epoch=17: 100%|██████████| 195/195 [06:16<00:00, 1.89s/it]Training loss = 657.708130, training accuracy = 0.853986 Test accuracy = 0.826122 Epoch=18: 100%|██████████| 195/195 [06:10<00:00, 1.88s/it]Training loss = 627.918579, training accuracy = 0.860216 Test accuracy = 0.844752 Epoch=19: 100%|██████████| 195/195 [06:13<00:00, 1.91s/it]Training loss = 592.768982, training accuracy = 0.869551 Test accuracy = 0.845653 Epoch=20: 100%|██████████| 195/195 [06:19<00:00, 1.97s/it]Training loss = 561.560608, training accuracy = 0.875060 Test accuracy = 0.835938 Epoch=21: 100%|██████████| 195/195 [06:15<00:00, 1.97s/it]Training loss = 533.083740, training accuracy = 0.881370 Test accuracy = 0.849860 Epoch=22: 100%|██████████| 195/195 [06:12<00:00, 1.91s/it]Training loss = 508.004578, training accuracy = 0.885056 Test accuracy = 0.833434 Epoch=23: 100%|██████████| 195/195 [06:12<00:00, 1.92s/it]Training loss = 477.516602, training accuracy = 0.892488 Test accuracy = 0.858474 Epoch=24: 100%|██████████| 195/195 [06:20<00:00, 1.96s/it]Training loss = 455.839996, training accuracy = 0.896595 Test accuracy = 0.867388 Epoch=25: 100%|██████████| 195/195 [06:16<00:00, 1.95s/it]Training loss = 434.568390, training accuracy = 0.904327 Test accuracy = 0.858774 Epoch=26: 100%|██████████| 195/195 [06:10<00:00, 1.87s/it]Training loss = 414.232391, training accuracy = 0.907071 Test accuracy = 0.833333 Epoch=27: 100%|██████████| 195/195 [06:13<00:00, 1.87s/it]Training loss = 400.625458, training accuracy = 0.909275 Test accuracy = 0.858974 Epoch=28: 100%|██████████| 195/195 [06:20<00:00, 1.95s/it]Training loss = 378.750885, training accuracy = 0.914443 Test accuracy = 0.865885 Epoch=29: 100%|██████████| 195/195 [06:14<00:00, 1.91s/it]Training loss = 369.449249, training accuracy = 0.917548 Test accuracy = 0.871394 Epoch=30: 100%|██████████| 195/195 [06:13<00:00, 1.93s/it]Training loss = 345.693939, training accuracy = 0.921935 Test accuracy = 0.868389 Epoch=31: 100%|██████████| 195/195 [06:13<00:00, 1.88s/it]Training loss = 333.472717, training accuracy = 0.924860 Test accuracy = 0.865885 Epoch=32: 100%|██████████| 195/195 [06:15<00:00, 1.97s/it]Training loss = 316.274231, training accuracy = 0.927244 Test accuracy = 0.867889 Epoch=33: 100%|██████████| 195/195 [06:15<00:00, 1.95s/it]Training loss = 300.943665, training accuracy = 0.931871 Test accuracy = 0.871194 Epoch=34: 100%|██████████| 195/195 [06:12<00:00, 1.93s/it]Training loss = 299.318787, training accuracy = 0.931270 Test accuracy = 0.876402 Epoch=35: 100%|██████████| 195/195 [06:10<00:00, 1.88s/it]Training loss = 285.711884, training accuracy = 0.935317 Test accuracy = 0.879207 Epoch=36: 100%|██████████| 195/195 [06:16<00:00, 1.98s/it]Training loss = 266.605042, training accuracy = 0.939844 Test accuracy = 0.882612 Epoch=37: 100%|██████████| 195/195 [06:15<00:00, 1.93s/it]Training loss = 253.637848, training accuracy = 0.943069 Test accuracy = 0.882111 Epoch=38: 100%|██████████| 195/195 [06:09<00:00, 1.92s/it]Training loss = 243.406281, training accuracy = 0.944832 Test accuracy = 0.888421 Epoch=39: 100%|██████████| 195/195 [06:11<00:00, 1.92s/it]Training loss = 236.608551, training accuracy = 0.945553 Test accuracy = 0.868089 Epoch=40: 100%|██████████| 195/195 [06:21<00:00, 1.93s/it]Training loss = 226.691986, training accuracy = 0.948798 Test accuracy = 0.874099 Epoch=41: 100%|██████████| 195/195 [06:15<00:00, 1.94s/it]Training loss = 210.119171, training accuracy = 0.952724 Test accuracy = 0.885517 Epoch=42: 100%|██████████| 195/195 [06:12<00:00, 1.92s/it]Training loss = 200.071671, training accuracy = 0.954687 Test accuracy = 0.872696 Epoch=43: 100%|██████████| 195/195 [06:13<00:00, 1.94s/it]Training loss = 201.704514, training accuracy = 0.954527 Test accuracy = 0.867788 Epoch=44: 100%|██████████| 195/195 [06:20<00:00, 1.95s/it]Training loss = 197.687622, training accuracy = 0.955469 Test accuracy = 0.868690 Epoch=45: 100%|██████████| 195/195 [06:15<00:00, 1.93s/it]Training loss = 176.998566, training accuracy = 0.959675 Test accuracy = 0.879307 Epoch=46: 100%|██████████| 195/195 [06:12<00:00, 1.94s/it]Training loss = 169.160126, training accuracy = 0.961478 Test accuracy = 0.879307 Epoch=47: 100%|██████████| 195/195 [06:13<00:00, 1.94s/it]Training loss = 166.751923, training accuracy = 0.961438 Test accuracy = 0.876202 Epoch=48: 100%|██████████| 195/195 [06:20<00:00, 1.94s/it]Training loss = 163.559586, training accuracy = 0.962460 Test accuracy = 0.886218 Epoch=49: 100%|██████████| 195/195 [06:14<00:00, 1.91s/it]Training loss = 157.634018, training accuracy = 0.964483 Test accuracy = 0.882812 Epoch=50: 100%|██████████| 195/195 [06:12<00:00, 1.90s/it]Training loss = 142.496307, training accuracy = 0.967869 Test accuracy = 0.886218 Epoch=51: 100%|██████████| 195/195 [06:09<00:00, 1.81s/it]Training loss = 140.872879, training accuracy = 0.968169 Test accuracy = 0.894131 Epoch=52: 100%|██████████| 195/195 [06:20<00:00, 1.99s/it]Training loss = 142.073883, training accuracy = 0.968189 Test accuracy = 0.889824 Epoch=53: 100%|██████████| 195/195 [06:16<00:00, 1.88s/it]Training loss = 138.559738, training accuracy = 0.968329 Test accuracy = 0.876903 Epoch=54: 100%|██████████| 195/195 [06:10<00:00, 1.92s/it]Training loss = 132.399109, training accuracy = 0.969752 Test accuracy = 0.890425 Epoch=55: 100%|██████████| 195/195 [06:11<00:00, 1.91s/it]Training loss = 123.129364, training accuracy = 0.971755 Test accuracy = 0.881711 Epoch=56: 100%|██████████| 195/195 [06:21<00:00, 1.93s/it]Training loss = 121.916557, training accuracy = 0.971995 Test accuracy = 0.894631 Epoch=57: 100%|██████████| 195/195 [06:14<00:00, 1.91s/it]Training loss = 111.385445, training accuracy = 0.974860 Test accuracy = 0.891426 Epoch=58: 100%|██████████| 195/195 [06:10<00:00, 1.87s/it]Training loss = 117.021904, training accuracy = 0.973938 Test accuracy = 0.886719 Epoch=59: 100%|██████████| 195/195 [06:11<00:00, 1.89s/it]Training loss = 100.442093, training accuracy = 0.977264 Test accuracy = 0.884215 Epoch=60: 100%|██████████| 195/195 [06:18<00:00, 1.92s/it]Training loss = 103.660690, training accuracy = 0.976342 Test accuracy = 0.890525 Epoch=61: 100%|██████████| 195/195 [06:15<00:00, 1.93s/it]Training loss = 106.059982, training accuracy = 0.975861 Test accuracy = 0.897236 Epoch=62: 100%|██████████| 195/195 [06:10<00:00, 1.89s/it]Training loss = 100.289398, training accuracy = 0.977604 Test accuracy = 0.887921 Epoch=63: 100%|██████████| 195/195 [06:12<00:00, 1.91s/it]Training loss = 93.661957, training accuracy = 0.978906 Test accuracy = 0.880108 Epoch=64: 100%|██████████| 195/195 [06:19<00:00, 1.92s/it]Training loss = 88.674843, training accuracy = 0.980048 Test accuracy = 0.886719 Epoch=65: 100%|██████████| 195/195 [06:15<00:00, 1.92s/it]Training loss = 88.595192, training accuracy = 0.980088 Test accuracy = 0.882111 Epoch=66: 100%|██████████| 195/195 [06:12<00:00, 1.93s/it]Training loss = 80.745857, training accuracy = 0.982272 Test accuracy = 0.894331 Epoch=67: 100%|██████████| 195/195 [06:12<00:00, 1.91s/it]Training loss = 79.769966, training accuracy = 0.982151 Test accuracy = 0.893530 Epoch=68: 100%|██████████| 195/195 [06:21<00:00, 1.97s/it]Training loss = 86.334030, training accuracy = 0.980369 Test accuracy = 0.883413 Epoch=69: 100%|██████████| 195/195 [06:14<00:00, 1.91s/it]Training loss = 82.313301, training accuracy = 0.982091 Test accuracy = 0.889423 Epoch=70: 100%|██████████| 195/195 [06:10<00:00, 1.89s/it]Training loss = 76.229935, training accuracy = 0.983373 Test accuracy = 0.870292 Epoch=71: 100%|██████████| 195/195 [06:12<00:00, 1.95s/it]Training loss = 71.863472, training accuracy = 0.983914 Test accuracy = 0.893930 Epoch=72: 100%|██████████| 195/195 [06:20<00:00, 1.94s/it]Training loss = 66.012581, training accuracy = 0.985156 Test accuracy = 0.898337 Epoch=73: 100%|██████████| 195/195 [06:15<00:00, 1.96s/it]Training loss = 61.428085, training accuracy = 0.986619 Test accuracy = 0.893029 Epoch=74: 100%|██████████| 195/195 [06:11<00:00, 1.90s/it]Training loss = 67.723068, training accuracy = 0.984976 Test accuracy = 0.898538 Epoch=75: 100%|██████████| 195/195 [06:13<00:00, 1.91s/it]Training loss = 65.637268, training accuracy = 0.985176 Test accuracy = 0.900741 Epoch=76: 100%|██████████| 195/195 [06:18<00:00, 1.97s/it]Training loss = 67.880424, training accuracy = 0.985036 Test accuracy = 0.897536 Epoch=77: 100%|██████████| 195/195 [06:16<00:00, 1.93s/it]Training loss = 61.967018, training accuracy = 0.986078 Test accuracy = 0.897436 Epoch=78: 100%|██████████| 195/195 [06:13<00:00, 1.93s/it]Training loss = 61.895309, training accuracy = 0.986058 Test accuracy = 0.898938 Epoch=79: 100%|██████████| 195/195 [06:13<00:00, 1.90s/it]Training loss = 61.111233, training accuracy = 0.985697 Test accuracy = 0.898738 Epoch=80: 100%|██████████| 195/195 [06:21<00:00, 1.97s/it]Training loss = 55.601448, training accuracy = 0.987099 Test accuracy = 0.899639 Epoch=81: 100%|██████████| 195/195 [06:13<00:00, 1.89s/it]Training loss = 57.219810, training accuracy = 0.987500 Test accuracy = 0.887720 Epoch=82: 100%|██████████| 195/195 [06:13<00:00, 1.92s/it]Training loss = 58.462112, training accuracy = 0.987240 Test accuracy = 0.894832 Epoch=83: 100%|██████████| 195/195 [06:11<00:00, 1.86s/it]Training loss = 55.885990, training accuracy = 0.987500 Test accuracy = 0.904647 Epoch=84: 100%|██████████| 195/195 [06:21<00:00, 2.00s/it]Training loss = 48.977169, training accuracy = 0.988982 Test accuracy = 0.870192 Epoch=85: 100%|██████████| 195/195 [06:15<00:00, 1.93s/it]Training loss = 47.429367, training accuracy = 0.989984 Test accuracy = 0.880208 Epoch=86: 100%|██████████| 195/195 [06:12<00:00, 1.88s/it]Training loss = 51.012726, training accuracy = 0.988401 Test accuracy = 0.890124 Epoch=87: 100%|██████████| 195/195 [06:14<00:00, 1.95s/it]Training loss = 49.567501, training accuracy = 0.988702 Test accuracy = 0.901042 Epoch=88: 100%|██████████| 195/195 [06:20<00:00, 1.96s/it]Training loss = 44.965919, training accuracy = 0.990124 Test accuracy = 0.890925 Epoch=89: 100%|██████████| 195/195 [06:17<00:00, 1.98s/it]Training loss = 52.335827, training accuracy = 0.988241 Test accuracy = 0.898438 Epoch=90: 100%|██████████| 195/195 [06:11<00:00, 1.90s/it]Training loss = 43.000404, training accuracy = 0.990204 Test accuracy = 0.906050 Epoch=91: 100%|██████████| 195/195 [06:12<00:00, 1.90s/it]Training loss = 44.402187, training accuracy = 0.990865 Test accuracy = 0.881010 Epoch=92: 100%|██████████| 195/195 [06:21<00:00, 1.93s/it]Training loss = 42.708675, training accuracy = 0.991026 Test accuracy = 0.898738 Epoch=93: 100%|██████████| 195/195 [06:14<00:00, 1.96s/it]Training loss = 40.271782, training accuracy = 0.991346 Test accuracy = 0.880809 Epoch=94: 100%|██████████| 195/195 [06:10<00:00, 1.88s/it]Training loss = 43.947540, training accuracy = 0.990224 Test accuracy = 0.897636 Epoch=95: 100%|██████████| 195/195 [06:12<00:00, 1.92s/it]Training loss = 39.025536, training accuracy = 0.991667 Test accuracy = 0.902143 Epoch=96: 100%|██████████| 195/195 [06:19<00:00, 1.98s/it]Training loss = 38.811058, training accuracy = 0.991526 Test accuracy = 0.902945 Epoch=97: 100%|██████████| 195/195 [06:15<00:00, 1.94s/it]Training loss = 44.107109, training accuracy = 0.990004 Test accuracy = 0.896034 Epoch=98: 100%|██████████| 195/195 [06:09<00:00, 1.91s/it]Training loss = 32.846859, training accuracy = 0.993109 Test accuracy = 0.898137 Epoch=99: 100%|██████████| 195/195 [06:13<00:00, 1.91s/it]Training loss = 35.559738, training accuracy = 0.992468 Test accuracy = 0.899639 ``` The code used is as below: ```python # # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, # software distributed under the License is distributed on an # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY # KIND, either express or implied. See the License for the # specific language governing permissions and limitations # under the License. # try: import pickle except ImportError: import cPickle as pickle from singa import singa_wrap as singa from singa import autograd from singa import tensor from singa import device from singa import opt import cv2 #from scipy import misc import numpy as np from tqdm import trange def load_dataset(filepath): print('Loading data file %s' % filepath) with open(filepath, 'rb') as fd: try: cifar10 = pickle.load(fd, encoding='latin1') except TypeError: cifar10 = pickle.load(fd) image = cifar10['data'].astype(dtype=np.uint8) image = image.reshape((-1, 3, 32, 32)) label = np.asarray(cifar10['labels'], dtype=np.uint8) label = label.reshape(label.size, 1) return image, label def load_train_data(dir_path='cifar-10-batches-py', num_batches=5): labels = [] batchsize = 10000 images = np.empty((num_batches * batchsize, 3, 32, 32), dtype=np.uint8) for did in range(1, num_batches + 1): fname_train_data = dir_path + "/data_batch_{}".format(did) image, label = load_dataset(fname_train_data) images[(did - 1) * batchsize:did * batchsize] = image labels.extend(label) images = np.array(images, dtype=np.float32) labels = np.array(labels, dtype=np.int32) return images, labels def load_test_data(dir_path='cifar-10-batches-py'): images, labels = load_dataset(dir_path + "/test_batch") return np.array(images, dtype=np.float32), np.array(labels, dtype=np.int32) def normalize_for_resnet(train_x, test_x): mean=[0.4914, 0.4822, 0.4465] std=[0.2023, 0.1994, 0.2010] train_x /= 255 test_x /= 255 for ch in range(0,2): train_x[:, ch, :, :] -= mean[ch] train_x[:, ch, :, :] /= std[ch] test_x[:, ch, :, :] -= mean[ch] test_x[:, ch, :, :] /= std[ch] return train_x, test_x def resize_dataset(x,IMG_SIZE): num_data = x.shape[0] dim = x.shape[1] X = np.zeros(shape=(num_data,dim,IMG_SIZE,IMG_SIZE), dtype=np.float32) for n in range(0,num_data): for d in range(0,dim): X[n, d, :, :] = cv2.resize(x[n , d, : ,:], (IMG_SIZE,IMG_SIZE)).astype(np.float32) return X def augmentation(x, batch_size): xpad = np.pad(x, [[0, 0], [0, 0], [4, 4], [4, 4]], 'symmetric') for data_num in range(0, batch_size): offset = np.random.randint(8, size=2) x[data_num,:,:,:] = xpad[data_num, :, offset[0]: offset[0] + 32, offset[1]: offset[1] + 32] if_flip = np.random.randint(2) if (if_flip): x[data_num, :, :, :] = x[data_num, :, :, ::-1] return x def accuracy(pred, target): y = np.argmax(pred, axis=1) t = np.argmax(target, axis=1) a = y == t return np.array(a, "int").sum() def to_categorical(y, num_classes): y = np.array(y, dtype="int") n = y.shape[0] categorical = np.zeros((n, num_classes)) for i in range(0,n): categorical[i, y[i]] = 1 categorical = categorical.astype(np.float32) return categorical def data_partition(dataset_x, dataset_y, rank_in_global, world_size): data_per_rank = dataset_x.shape[0] // world_size idx_start = rank_in_global * data_per_rank idx_end = (rank_in_global + 1) * data_per_rank return dataset_x[idx_start: idx_end], dataset_y[idx_start: idx_end] def sychronize(tensor, dist_opt): singa.synch(tensor.data, dist_opt.communicator) # cannot use tensor/=dist_opt.world_size because "/=" not in place, but "-=" is in place tensor -= (dist_opt.world_size - 1) * tensor / dist_opt.world_size if __name__ == '__main__': sgd = opt.SGD(lr=0.04, momentum=0.9, weight_decay=1e-5) sgd = opt.DistOpt(sgd) #load dataset #need to download with "/python3 incubator-singa/examples/cifar10/download_data.py py" train_x, train_y = load_train_data() test_x, test_y = load_test_data() train_x, test_x = normalize_for_resnet(train_x, test_x) train_x, train_y = data_partition(train_x, train_y, sgd.rank_in_global, sgd.world_size) test_x, test_y = data_partition(test_x, test_y, sgd.rank_in_global, sgd.world_size) num_classes=10 from resnet import resnet50 model = resnet50(num_classes=num_classes) print('Start intialization............') dev = device.create_cuda_gpu_on(sgd.rank_in_local) max_epoch = 100 batch_size = 32 IMG_SIZE = 224 tx = tensor.Tensor((batch_size, 3, IMG_SIZE, IMG_SIZE), dev, tensor.float32) ty = tensor.Tensor((batch_size,), dev, tensor.int32) num_train_batch = train_x.shape[0] // batch_size num_test_batch = test_x.shape[0] // batch_size idx = np.arange(train_x.shape[0], dtype=np.int32) reducer = tensor.Tensor((1,), dev, tensor.float32) #allreduce the initialize parameter autograd.training = True #x = np.zeros(shape=[batch_size, 3, IMG_SIZE, IMG_SIZE], dtype=np.float32) #y = np.zeros(shape=[batch_size], dtype=np.int32) x = np.random.randn(batch_size, 3, IMG_SIZE, IMG_SIZE).astype(np.float32) y = np.random.randint(0, num_classes, batch_size, dtype=np.int32) tx.copy_from_numpy(x) ty.copy_from_numpy(y) out = model(tx) loss = autograd.softmax_cross_entropy(out, ty) for p, g in autograd.backward(loss): sychronize(p, sgd) for epoch in range(max_epoch): np.random.shuffle(idx) #Training Phase autograd.training = True train_correct = np.zeros(shape=[1],dtype=np.float32) test_correct = np.zeros(shape=[1],dtype=np.float32) train_loss = np.zeros(shape=[1],dtype=np.float32) with trange(num_train_batch) as t: t.set_description('Epoch={}'.format(epoch)) for b in t: x = train_x[idx[b * batch_size: (b + 1) * batch_size]] x = augmentation(x, batch_size) x = resize_dataset(x,IMG_SIZE) y = train_y[idx[b * batch_size: (b + 1) * batch_size]] tx.copy_from_numpy(x) ty.copy_from_numpy(y) out = model(tx) loss = autograd.softmax_cross_entropy(out, ty) train_correct += accuracy(tensor.to_numpy(out), to_categorical(y, num_classes)).astype(np.float32) train_loss += tensor.to_numpy(loss)[0] for p, g in autograd.backward(loss): sgd.update(p, g) #print("rank"+str(sgd.rank_in_global)+": Acc="+str(train_correct)+". Loss="+str(train_loss), flush=True) #print("world size="+str(sgd.world_size), flush=True) #reduce all the accuracy and loss from different rank reducer.copy_from_numpy(train_correct) reducer=sgd.all_reduce(reducer) train_correct = tensor.to_numpy(reducer) reducer.copy_from_numpy(train_loss) reducer=sgd.all_reduce(reducer) train_loss = tensor.to_numpy(reducer) * sgd.world_size #if(sgd.rank_in_global==0): # print('Training loss = %f, Acc count = %f' % (train_loss, train_correct), flush=True) if(sgd.rank_in_global==0): print('Training loss = %f, training accuracy = %f' % (train_loss, train_correct / (num_train_batch*batch_size)), flush=True) #Evaulation Phase autograd.training = False for b in range(num_test_batch): x = test_x[b * batch_size: (b + 1) * batch_size] x = resize_dataset(x,IMG_SIZE) y = test_y[b * batch_size: (b + 1) * batch_size] tx.copy_from_numpy(x) ty.copy_from_numpy(y) out_test = model(tx) test_correct += accuracy(tensor.to_numpy(out_test), to_categorical(y, num_classes)) reducer.copy_from_numpy(test_correct) reducer=sgd.all_reduce(reducer) test_correct = tensor.to_numpy(reducer) if(sgd.rank_in_global==0): print('Test accuracy = %f' % (test_correct / (num_test_batch*(batch_size))), flush=True) ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org With regards, Apache Git Services