systemml-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "LI Guobao (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives
Date Sat, 05 May 2018 21:04:00 GMT

     [ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

LI Guobao updated SYSTEMML-2085:
--------------------------------
    Description: A single node parameter server acts as a data-parallel parameter server.
And a multi-node model parallel parameter server will be discussed if time permits. The idea
is to run a single-node parameter server by maintaining a hashmap inside the CP (Control
Program) where the parameter as value accompanied with a defined key. For example, inserting
the global parameter with a key named “worker-param-replica” allows the workers to retrieve
the parameter replica. Hence, in the context of local multi-threaded backend, workers can
communicate directly with this hashmap in the same process. And in the context of Spark distributed
backend, the CP firstly needs to fork a thread to start a parameter server which maintains
a hashmap. And secondly the workers can send intermediates and retrieve parameters by connecting
to parameter server via TCP socket. Since SystemML has good cache management, we only need
to maintain the matrix reference pointing to a file location instead of real data instance
in the hashmap. If time permits, to be able to introduce the async and staleness update strategies,
we would need to implement the synchronization by leveraging vector clock.  (was: A single
node parameter server acts as a data-parallel parameter server. And a multi-node model parallel
parameter server will be discussed if time permits. )

> Single-node parameter server primitives
> ---------------------------------------
>
>                 Key: SYSTEMML-2085
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2085
>             Project: SystemML
>          Issue Type: Sub-task
>            Reporter: Matthias Boehm
>            Assignee: LI Guobao
>            Priority: Major
>
> A single node parameter server acts as a data-parallel parameter server. And a multi-node
model parallel parameter server will be discussed if time permits. The idea is to run a
single-node parameter server by maintaining a hashmap inside the CP (Control Program) where
the parameter as value accompanied with a defined key. For example, inserting the global parameter
with a key named “worker-param-replica” allows the workers to retrieve the parameter replica.
Hence, in the context of local multi-threaded backend, workers can communicate directly with
this hashmap in the same process. And in the context of Spark distributed backend, the CP
firstly needs to fork a thread to start a parameter server which maintains a hashmap. And
secondly the workers can send intermediates and retrieve parameters by connecting to parameter
server via TCP socket. Since SystemML has good cache management, we only need to maintain
the matrix reference pointing to a file location instead of real data instance in the hashmap.
If time permits, to be able to introduce the async and staleness update strategies, we would
need to implement the synchronization by leveraging vector clock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message