systemml-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthias Boehm (JIRA)" <>
Subject [jira] [Commented] (SYSTEMML-2418) Spark data partitioner
Date Wed, 27 Jun 2018 19:53:00 GMT


Matthias Boehm commented on SYSTEMML-2418:

You might wanna rephrase the introducing sentence a bit. Fundamentally, we want both local
and distributed data partitioners to support the general case of arbitrary sizes (don't use
"overfitting" because it can easily be confused) - if the data fits into the driver, we can
do local partitioning otherwise we use distributed data partitioners. For this task, I would
recommend to focus on the distributed partitioning, into k partitions that are the immediate
input for the individual workers without need for materialization on hdfs. In pseudo code,
it would look like {{data.flatmap(d -> partition(d)).reduceByKey(k).forEach(d -> runWorker(d))}}.
Later we can optionally also allow the materialization on HDFS. The scatch space is shared
by all workers, but for worker-local intermediates and results, we create dedicated subdirectories.

> Spark data partitioner
> ----------------------
>                 Key: SYSTEMML-2418
>                 URL:
>             Project: SystemML
>          Issue Type: Sub-task
>            Reporter: LI Guobao
>            Assignee: LI Guobao
>            Priority: Major
> In the context of ml, the training data will be usually overfitted in spark driver node.
So to partition such enormous data is no more feasible in CP. This task aims to do the data
partitioning in distributed way which means that the workers will receive its split of training
data and do the data partition locally according to different schemes. And then all the data
will be grouped by the given key (i.e., the worker id) and at last be written into the seperate
HDFS file in scratch place.

This message was sent by Atlassian JIRA

View raw message