systemml-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "LI Guobao (JIRA)" <>
Subject [jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives
Date Wed, 09 May 2018 20:23:00 GMT


LI Guobao updated SYSTEMML-2085:
    Attachment:     (was: ps.png)

> Single-node parameter server primitives
> ---------------------------------------
>                 Key: SYSTEMML-2085
>                 URL:
>             Project: SystemML
>          Issue Type: Sub-task
>            Reporter: Matthias Boehm
>            Assignee: LI Guobao
>            Priority: Major
>         Attachments: ps.png
> A single node parameter server acts as a data-parallel parameter server. And a multi-node
model parallel parameter server will be discussed if time permits. 
> Push/Pull service: 
> In general, we could launch a parameter server inside (local multi-thread backend) or
outside (spark distributed backend) of CP to provide the pull and push service. For the moment,
all the weights and biases are saved in a hashmap using a key, e.g., "global parameter". Each
worker's gradients will be put into the hashmap seperately with a given key. And the exchange
between server and workers will be implemented by TCP. Hence, we could easily broadcast the
IP address and the port number to the workers. And then the workers can send the gradients
and retrieve the new parameters via TCP socket. The server will also spawn a thread which
retrieves the gradients by polling the hashmap using relevant keys and aggregates them. At
last, it updates the global parameter in the hashmap.
> Synchronization:
> We also need to implement the synchronization between workers and parameter server to
be able to bring more parameter update strategies, e.g., the stale-synchronous strategy needs
a hyperparameter "staleness" to define the waiting interval. The idea is to maintain a vector
clock recording all workers' clock in the server. Each time when an iteration in side of
worker finishes, it waits server to give a signal, i.e., to send a request for calculating
the staleness according to the vector clock. And when the server receives the gradients from
certain worker, it will increment the vector clock for this worker. So we could define BSP
as "staleness==0", ASP as "staleness==-1" and SSP as "staleness==N".
> A diagram of the parameter server architecture is shown below.

This message was sent by Atlassian JIRA

View raw message