From dev-return-7171-archive-asf-public=cust-asf.ponee.io@mxnet.incubator.apache.org  Fri Jan 10 23:15:58 2020
Return-Path: <dev-return-7171-archive-asf-public=cust-asf.ponee.io@mxnet.incubator.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 5344F180657
	for <archive-asf-public@cust-asf.ponee.io>; Sat, 11 Jan 2020 00:15:58 +0100 (CET)
Received: (qmail 17882 invoked by uid 500); 10 Jan 2020 23:15:57 -0000
Mailing-List: contact dev-help@mxnet.incubator.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@mxnet.incubator.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@mxnet.incubator.apache.org>
List-Post: <mailto:dev@mxnet.incubator.apache.org>
List-Id: <dev.mxnet.incubator.apache.org>
Reply-To: dev@mxnet.incubator.apache.org
Delivered-To: mailing list dev@mxnet.incubator.apache.org
Received: (qmail 17868 invoked by uid 99); 10 Jan 2020 23:15:56 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Jan 2020 23:15:56 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 02BACC09DA
	for <dev@mxnet.apache.org>; Fri, 10 Jan 2020 23:15:56 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -0.651
X-Spam-Level:
X-Spam-Status: No, score=-0.651 tagged_above=-999 required=6.31
	tests=[DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1,
	DKIM_VALID_AU=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.25,
	HTML_MESSAGE=0.2, MAILING_LIST_MULTI=-1, RCVD_IN_DNSWL_NONE=-0.0001,
	SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=disabled
Authentication-Results: spamd1-us-west.apache.org (amavisd-new);
	dkim=pass (1024-bit key) header.d=github.com
Received: from mx1-ec2-va.apache.org ([10.40.0.8])
	by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024)
	with ESMTP id xFseLF3U10mh for <dev@mxnet.apache.org>;
	Fri, 10 Jan 2020 23:15:54 +0000 (UTC)
Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.167.49; helo=mail-lf1-f49.google.com; envelope-from=dmlc.notification+caf_=dev=mxnet.apache.org@gmail.com; receiver=<UNKNOWN> 
Received: from mail-lf1-f49.google.com (mail-lf1-f49.google.com [209.85.167.49])
	by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id B165BBC564
	for <dev@mxnet.apache.org>; Fri, 10 Jan 2020 23:15:53 +0000 (UTC)
Received: by mail-lf1-f49.google.com with SMTP id n25so2733037lfl.0
        for <dev@mxnet.apache.org>; Fri, 10 Jan 2020 15:15:53 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:delivered-to:date:dkim-signature:from:reply-to
         :to:cc:message-id:subject:mime-version:content-transfer-encoding
         :precedence:list-id:list-archive:list-post:list-unsubscribe;
        bh=oAYXOHMOvp1LmDfm89EGE4BTrmDdyXM5zQGD9P/Zx1k=;
        b=PtItF1Jf0YSdqhOTkgZhSm3Kwgq2fx6VFR7wFcX7d3FwcHCK75M2Wu+0/KbYbLPi1G
         88EAKQT8ljkLNMLy59LVVYZQdJJqX4qtwGkEtRe7cEPLMDZvlnwJz2Q0XHmX3pwyPzzV
         2WbKdjV1YdKd9OIEUMxiPYqSVein79FzwBx9fhR4vYtXpcTKcfgdR2F3JVDsvdBcVxzs
         Gl49phyljzkGrJryIfW0CogBzOY5fjIcWPlpMwCQiTMeT0ql4ldbiuXsyCao/xKXV68y
         +MGNQrTrORaGTHZcv5jOsjLysHVGKjjfIUzmQWz2tiS20K0U9sKse4tKmYwBUq75xoxT
         STjA==
X-Gm-Message-State: APjAAAUKq+QWDiPtlY5PhbH3AUU/Ufl1SoMXuOhyce4zlwlMTx68kpi1
	5Bb9QG/QDBqZHje+kCURRFID1mplAGOMxvgKt5YyGutRtNjNwjs=
X-Received: by 2002:ac2:44ce:: with SMTP id d14mr3822585lfm.140.1578698152081;
        Fri, 10 Jan 2020 15:15:52 -0800 (PST)
X-Forwarded-To: dev@mxnet.apache.org
X-Forwarded-For: dmlc.notification@gmail.com dev@mxnet.apache.org
Delivered-To: dmlc.notification@gmail.com
Received: by 2002:ab3:6f55:0:0:0:0:0 with SMTP id k21csp1316099ltp;
        Fri, 10 Jan 2020 15:15:50 -0800 (PST)
X-Google-Smtp-Source: APXvYqw6Z+2OTcIKGsJEtjXao53mTot09eHxflEj7X5XqR+tAK7EvA0OhOzAuaT4LHWMYvpNHZMg
X-Received: by 2002:a0c:b5cd:: with SMTP id o13mr1094934qvf.47.1578698150241;
        Fri, 10 Jan 2020 15:15:50 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1578698150; cv=none;
        d=google.com; s=arc-20160816;
        b=EFjyeT2vfHkorvZJ38V1jo6dFS3bh3/ShU2iONZKeqNLJUy6zyXnJ+cMWc9YSCgzWe
         HpE/1CqhmeFlYL32YitaoSUFBI4hO7dml4EJSgnGPT/yXSozdzKY7TsgMxrAGfSmyXCq
         87dyk6QJvlYlmWQJu6PxQcrok1mmFC18F8Zhw4akW6XlpDNR74L0c60qJWq3CCuU5phC
         WiiRwdZhy+pVNiFGzR3PhdwPgjb0RE/JvaKpvzUiT11GXtFVdyyLJj80vD5HgZbx1pKc
         cQrI1kRzy7GRSxbIWmxJALcuGX4qRsPfTEjEm87oes+Emn0ggP1ngOmP37D6qcHqPfZW
         /P5g==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816;
        h=list-unsubscribe:list-post:list-archive:list-id:precedence
         :content-transfer-encoding:mime-version:subject:message-id:cc:to
         :reply-to:from:dkim-signature:date;
        bh=oAYXOHMOvp1LmDfm89EGE4BTrmDdyXM5zQGD9P/Zx1k=;
        b=chwhBHH5i06cCdq6r1dm6NK/il5qio/X+6pUH5IexqkhNCeVx0JM5OmJjfl82VYiKj
         9Msj4OXXsUfVYh+HVNPWBHu1Nokje28RYEFblGAsL7UB8FUPD7AOHsySGJROSJOoqq6q
         MEKj6qK8RM8dhk2i77ZjWyliCqokUQiEUMYVe8YljZDuYwqCW0rl7jCCdJfbcGOi2x0r
         40DEVSiyiK94ha9/vKyTLRBIYP2q4vjDvHdxyrRkg7Nawwa4dJVyUba1HkFWzsj7sMTd
         0ONR1TZlkIbihDtgSfQgPCz1Jcklb6LGxPpPBQNFQ7UFRhRB1DmkwRhpi0untAYlNeU6
         eKzQ==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass (test mode) header.i=@github.com header.s=pf2014 header.b=MiGF8Xvv;
       spf=pass (google.com: domain of noreply@github.com designates 192.30.252.201 as permitted sender) smtp.mailfrom=noreply@github.com;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=github.com
Received: from out-18.smtp.github.com (out-18.smtp.github.com. [192.30.252.201])
        by mx.google.com with ESMTPS id d58si2161630qta.140.2020.01.10.15.15.49
        for <dmlc.notification@gmail.com>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 10 Jan 2020 15:15:50 -0800 (PST)
Received-SPF: pass (google.com: domain of noreply@github.com designates 192.30.252.201 as permitted sender) client-ip=192.30.252.201;
Authentication-Results: mx.google.com;
       dkim=pass (test mode) header.i=@github.com header.s=pf2014 header.b=MiGF8Xvv;
       spf=pass (google.com: domain of noreply@github.com designates 192.30.252.201 as permitted sender) smtp.mailfrom=noreply@github.com;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=github.com
Date: Fri, 10 Jan 2020 15:15:49 -0800
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=github.com;
	s=pf2014; t=1578698149;
	bh=oAYXOHMOvp1LmDfm89EGE4BTrmDdyXM5zQGD9P/Zx1k=;
	h=Date:From:Reply-To:To:Cc:Subject:List-ID:List-Archive:List-Post:
	 List-Unsubscribe:From;
	b=MiGF8XvvcBbYyzzcPA3DVCyNZe/bmtLplCXKAEaNzYZlPFpe/CT3C0xruODHJWsR2
	 YTZ0cJfLo9znCIT3NgFWKH1lr4zmXv7VjKIX5EsQYBuXSkgeExv5/667W2qDVTC83n
	 dxmH4km68e8RHbdYH0tD8tVB0mkLlUbeSCrLjVTQ=
From: "Joshua Z. Zhang" <notifications@github.com>
Reply-To: apache/incubator-mxnet <reply+AAUBKQ27PA2MXESLNLNRDIN4EY4CLEVBNHHCBLXNWQ@reply.github.com>
To: apache/incubator-mxnet <incubator-mxnet@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <apache/incubator-mxnet/issues/17269@github.com>
Subject: [apache/incubator-mxnet] [mxnet 2.0][item 4.8][RFC] Gluon Data API
 Extension and Fixes(Part 2) (#17269)
Mime-Version: 1.0
Content-Type: multipart/alternative;
 boundary="--==_mimepart_5e1905a573743_6aa3fd1d18cd95c942a5";
 charset=UTF-8
Content-Transfer-Encoding: 7bit
X-GitHub-Sender: zhreshold
X-GitHub-Recipient: szha
X-GitHub-Reason: subscribed
List-Archive: https://github.com/apache/incubator-mxnet
X-Auto-Response-Suppress: All
X-GitHub-Recipient-Address: dmlc.notification@gmail.com

----==_mimepart_5e1905a573743_6aa3fd1d18cd95c942a5
Content-Type: text/plain;
 charset=UTF-8
Content-Transfer-Encoding: 7bit

## Description
This is the part 2 of Gluon Data API extension and fixes, which mainly focus on speed up the current data loading pipeline using gluon dataset and dataloader.

## Motivation

The current data loading pipeline is the major bottleneck for many training tasks. We can summarize the entire flow as:

```bash
| Dataset.__getitem__ -> 
| Transform.__call__()/forward() ->
| Batchify ->
| (optional communicate through shared_mem) ->
| split_and_load(ctxs) ->
| <training on GPUs>
-> 
```
where there are performance concerns:
- performance of python dataset/transform functions aren't satisfying
- it's not easy to embrace multithreading to speed up dataloading due to global interpreter lock
- python multiprocessing is unfortunately slow and error prune, not to mention the shared memory implementations on different OS are quite difference and very annoying(e.g., it's very likely to run out of shared memory if not properly taken care of)
- currently memory planing for batchify is non-exist, causing frequent alloc/dealloc for large chunk of memory if the batch size is big
- batchify then split and load can be optimized to partial_batchify

## Proposal
To alleviate the existing troubles I propose to use a hybrid solution, that is to 
- provide C++ Datasets that can cover the most usecases
    ```python
    from gluon.data.dataset import TupleDataset, ImageFolderDataset, ArrayDataset
    # as long as TupleDataset, ImageSequenceDataset, ArrayDataset are supported by backend
    dataset = TupleDataset([ImageSequenceDataset(img_paths), ArrayDataset(image_labels)])
    # dataset is an image classification dataset while fully supported in C++
    # with TupleDataset we can combine as many data as possible

    # a C++ backed Dataset can have a magic __handle__ method to return the c++ handle for reference
    class TupleDataset:
        def __init__(self, datasets):
            if all([callable(getattr(dataset, '__handle__')) for dataset in datasets]):
                # all supported by backend
                self._tuple_dataset = check_call(_LIB.MXTupleDatasetCreate([getattr(dataset, '__handle__') for dataset in datasets]))
            else:
                self._tuple_dataset = None

            def __handle__(self):
                return self._tuple_dataset
                    
    ```
- provide common C++ batchify functions that are split and context aware. Batchify with memory planner is TBD.
- provide a C++ `MultithreadingDataLoader` which inherit the same arguments as `gluon.data.DataLoader` but use mxnet internal multithreading rather than python multiprocessing.
- fallback to python multiprocessing whenever 
    - the dataset is not fully supported by backend(e.g., there are custom python datasets)
    - Transform is not fully hybridizable
    - Batchify is not fully supported by backend

User will continue to use the existing `gluon.data.DataLoader`, and the conversion will be applied automatically
```python

loader = gluon.data.DataLoader(hybrid_dataset.transform(hybrid_transform), batch_size=32, batchify_fn=hybrid_batchify)

def DataLoader:
    def __init__(self, dataset, ...):
        if isinstance(dataset, _LazyTransformDataset) and is_hybrid(dataset._transform) and is_hybrid(dataset) and is_hybrid(batchify_fn):
            self._mt_dataloader = check_call(_LIB.MXMultiThreadDataLoaderCreate(...))
    def __iter__(self):
        if self._mt_dataloader:
                return self._mt_dataloader
        else:
               # fallback to single thread normal dataloader or multiprocessing dataloader

```

With this change, mxnet 2.0 will get smooth transition to mixed data loaders. Please comment with specific examples where this proposal fail to accommodate.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/17269
----==_mimepart_5e1905a573743_6aa3fd1d18cd95c942a5--