systemml-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "LI Guobao (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives
Date Wed, 09 May 2018 10:20:00 GMT

     [ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

LI Guobao updated SYSTEMML-2085:
--------------------------------
    Description: 
A single node parameter server acts as a data-parallel parameter server. And a multi-node
model parallel parameter server will be discussed if time permits. 
 # For the case of local multi-thread parameter server, it is easy to maintain a concurrent
hashmap (where the parameters as value accompanied with a defined key) inside the CP. And
the workers are launched in multi-threaded way to execute the gradients calculation function
and push the gradients to the hashmap. An another thread will be launched to pull the gradients
from hashmap and call the aggregation function to update the parameters. 
 # For the case of spark distributed backend, we could launch a remote single parameter server
outside of CP (as a worker) to provide the pull and push service. For the moment, all the
weights and biases are saved in this single server. And the exchange between server and workers
will be implemented by TCP. Hence, we could easily broadcast the IP address and the port number
to the workers. And then the workers can send the gradients and retrieve the new parameters
via TCP socket. 

We could also need to implement the synchronisation between workers and parameter server to
be able to bring more parameter update strategies, e.g., the stale-synchronous strategy needs
a hyperparameter "staleness" to define the waiting interval. The idea is to maintain a vector
clock consisting of all workers' clock in the server. Each time when an iteration finishes,
the worker will send a request to server and then the server will send back a response to
indicate if the worker should wait or not.

  was:A single node parameter server acts as a data-parallel parameter server. And a multi-node
model parallel parameter server will be discussed if time permits. The idea is to run a
single-node parameter server by maintaining a hashmap inside the CP (Control Program) where
the parameter as value accompanied with a defined key. For example, inserting the global parameter
with a key named “worker-param-replica” allows the workers to retrieve the parameter replica.
Hence, in the context of local multi-threaded backend, workers can communicate directly with
this hashmap in the same process. And in the context of Spark distributed backend, the CP
firstly needs to fork a thread to start a parameter server which maintains a hashmap. And
secondly the workers can send intermediates and retrieve parameters by connecting to parameter
server via TCP socket. Since SystemML has good cache management, we only need to maintain
the matrix reference pointing to a file location instead of real data instance in the hashmap.
If time permits, to be able to introduce the async and staleness update strategies, we would
need to implement the synchronization by leveraging vector clock.


> Single-node parameter server primitives
> ---------------------------------------
>
>                 Key: SYSTEMML-2085
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2085
>             Project: SystemML
>          Issue Type: Sub-task
>            Reporter: Matthias Boehm
>            Assignee: LI Guobao
>            Priority: Major
>         Attachments: ps.png
>
>
> A single node parameter server acts as a data-parallel parameter server. And a multi-node
model parallel parameter server will be discussed if time permits. 
>  # For the case of local multi-thread parameter server, it is easy to maintain a concurrent
hashmap (where the parameters as value accompanied with a defined key) inside the CP. And
the workers are launched in multi-threaded way to execute the gradients calculation function
and push the gradients to the hashmap. An another thread will be launched to pull the gradients
from hashmap and call the aggregation function to update the parameters. 
>  # For the case of spark distributed backend, we could launch a remote single parameter
server outside of CP (as a worker) to provide the pull and push service. For the moment, all
the weights and biases are saved in this single server. And the exchange between server and
workers will be implemented by TCP. Hence, we could easily broadcast the IP address and the
port number to the workers. And then the workers can send the gradients and retrieve the new
parameters via TCP socket. 
> We could also need to implement the synchronisation between workers and parameter server
to be able to bring more parameter update strategies, e.g., the stale-synchronous strategy
needs a hyperparameter "staleness" to define the waiting interval. The idea is to maintain
a vector clock consisting of all workers' clock in the server. Each time when an iteration
finishes, the worker will send a request to server and then the server will send back a response
to indicate if the worker should wait or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message