systemml-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "LI Guobao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SYSTEMML-2418) Spark data partitioner
Date Wed, 27 Jun 2018 17:05:00 GMT

    [ https://issues.apache.org/jira/browse/SYSTEMML-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16525299#comment-16525299
] 

LI Guobao commented on SYSTEMML-2418:
-------------------------------------

[~mboehm7], is it correct my logic? By the way, I'd like to know if the scratch place is shared
by all the remote workers? If so, the workers could load the file from this hdfs repository?

> Spark data partitioner
> ----------------------
>
>                 Key: SYSTEMML-2418
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2418
>             Project: SystemML
>          Issue Type: Sub-task
>            Reporter: LI Guobao
>            Assignee: LI Guobao
>            Priority: Major
>
> In the context of ml, the training data will be usually overfitted in spark driver node.
So to partition such enormous data is no more feasible in CP. This task aims to do the data
partitioning in distributed way which means that the workers will receive its split of training
data and do the data partition locally according to different schemes. And then all the data
will be grouped by the given key (i.e., the worker id) and at last be written into the seperate
HDFS file in scratch place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message