From issues-return-8815-archive-asf-public=cust-asf.ponee.io@systemml.apache.org Wed Jun 27 21:53:04 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 07FD6180625 for ; Wed, 27 Jun 2018 21:53:03 +0200 (CEST) Received: (qmail 48527 invoked by uid 500); 27 Jun 2018 19:53:03 -0000 Mailing-List: contact issues-help@systemml.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@systemml.apache.org Delivered-To: mailing list issues@systemml.apache.org Received: (qmail 48518 invoked by uid 99); 27 Jun 2018 19:53:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Jun 2018 19:53:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 9E03D1A3DAC for ; Wed, 27 Jun 2018 19:53:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -110.301 X-Spam-Level: X-Spam-Status: No, score=-110.301 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id M0U2V6W9KWTg for ; Wed, 27 Jun 2018 19:53:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 8B1C85F4E7 for ; Wed, 27 Jun 2018 19:53:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 1E136E1020 for ; Wed, 27 Jun 2018 19:53:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 6773D23F97 for ; Wed, 27 Jun 2018 19:53:00 +0000 (UTC) Date: Wed, 27 Jun 2018 19:53:00 +0000 (UTC) From: "Matthias Boehm (JIRA)" To: issues@systemml.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (SYSTEMML-2418) Spark data partitioner MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SYSTEMML-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16525520#comment-16525520 ] Matthias Boehm commented on SYSTEMML-2418: ------------------------------------------ You might wanna rephrase the introducing sentence a bit. Fundamentally, we want both local and distributed data partitioners to support the general case of arbitrary sizes (don't use "overfitting" because it can easily be confused) - if the data fits into the driver, we can do local partitioning otherwise we use distributed data partitioners. For this task, I would recommend to focus on the distributed partitioning, into k partitions that are the immediate input for the individual workers without need for materialization on hdfs. In pseudo code, it would look like {{data.flatmap(d -> partition(d)).reduceByKey(k).forEach(d -> runWorker(d))}}. Later we can optionally also allow the materialization on HDFS. The scatch space is shared by all workers, but for worker-local intermediates and results, we create dedicated subdirectories. > Spark data partitioner > ---------------------- > > Key: SYSTEMML-2418 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2418 > Project: SystemML > Issue Type: Sub-task > Reporter: LI Guobao > Assignee: LI Guobao > Priority: Major > > In the context of ml, the training data will be usually overfitted in spark driver node. So to partition such enormous data is no more feasible in CP. This task aims to do the data partitioning in distributed way which means that the workers will receive its split of training data and do the data partition locally according to different schemes. And then all the data will be grouped by the given key (i.e., the worker id) and at last be written into the seperate HDFS file in scratch place. -- This message was sent by Atlassian JIRA (v7.6.3#76005)