systemml-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthias Boehm (JIRA)" <>
Subject [jira] [Commented] (SYSTEMML-2083) Language and runtime for parameter servers
Date Tue, 13 Feb 2018 06:24:02 GMT


Matthias Boehm commented on SYSTEMML-2083:

Great - thanks for your interest [~mpgovinda]. Below I try to give you a couple of pointers
and a more concrete idea of the project. Additional details regarding the individual sub tasks
will follow later and likely evolve over time. Note that we're happy to help out and guide
you where needed. 

SystemML is an existing system for a broad range of machine learning algorithms. Users write
their algorithms in an R- or Python-like syntax with abstract data types for scalars, matrix,
and frames as well as operations such as linear algebra, element-wise operations, aggregations,
indexing, and statistical functions. SystemML then automatically compiles these scripts into
hybrid runtime plans of single-node and distributed operations on MapReduce or Spark according
to data and cluster characteristics. For more details, please refer to our website (
as well as our "SystemML on Spark" paper (

In the past, we primarily focused on data- and task-parallel execution strategies (as described
above) but in the last years we also added support for deep learning including an nn script
library, various builtin functions for specific layers, as well as a native and GPU operations.

This epic aims to extend these capabilities by execution strategies for parameter servers.
We want to build alternative runtime backends as a foundation which would already enable users
to easily select their preferred strategy for local or distributed execution. Later (not part
of this project) we would like to futher extend this to the automatic selection of these strategies.

Specifically, this project aims to introduce a new builtin function, called {{paramserv}}
that can be called at script level.
[model’] = paramserv(model, X, y, X_val, y_val, fun1,
   mode=ASYNC, freq=EPOCH, agg=..., epochs=100, batchsize=64, k=7, checkpointing=...)
where we pass an existing (e.g., for transfer learning) or otherwise initialized {{model}},
the training feature and label matrices {{X}}, {{y}}, the validation features and labels {{X_val}},
{{y_val}}, a batch update function specified in SystemML's R- or Python-like language, an
update strategy {{mode}} along with its frequency {{freq}} (e.g., per batch or epoch), an
aggregation function {{agg}}, the number of {{epochs}}, {{batchsize}}, degree of parallelism
{{k}}, and a checkpointing strategy. 

The core of the project then deals with implementing the runtime for this builtin function
in Java for both local, multi-threaded execution and distributed execution on top Spark. The
advantage of building the distributed parameter servers on top of the data-parallel Spark
framework is a seamless integration with the rest of the SystemML (e.g., where the input feature
matrix {{X}} cab be a large RDD). Since the update and aggregation functions are expressed
in SystemML's language, we can simply reuse the existing runtime (control flow, instructions,
and matrix operations) and concentrate on building alternative parameter update mechanisms.

> Language and runtime for parameter servers
> ------------------------------------------
>                 Key: SYSTEMML-2083
>                 URL:
>             Project: SystemML
>          Issue Type: Epic
>            Reporter: Matthias Boehm
>            Priority: Major
>              Labels: gsoc2018
> SystemML already provides a rich set of execution strategies ranging from local operations
to large-scale computation on MapReduce or Spark. In this context, we support both data-parallel
(multi-threaded or distributed operations) as well as task-parallel computation (multi-threaded
or distributed parfor loops). This epic aims to complement the existing execution strategies
by language and runtime primitives for parameter servers, i.e., model-parallel execution.
We use the terminology of model-parallel execution with distributed data and distributed model
to differentiate them from the existing data-parallel operations. Target applications are
distributed deep learning and mini-batch algorithms in general. These new abstractions will
help making SystemML a unified framework for small- and large-scale machine learning that
supports all three major execution strategies in a single framework.
> A major challenge is the integration of stateful parameter servers and their common push/pull
primitives into an otherwise functional (and thus, stateless) language. We will approach this
challenge via a new builtin function \{{paramserv}} which internally maintains state but at
the same time fits into the runtime framework of stateless operations.
> Furthermore, we are interested in providing (1) different runtime backends (local and
distributed), (2) different parameter server modes (synchronous, asynchronous, hogwild!, stale-synchronous),
(3) different update frequencies (batch, multi-batch, epoch), as well as (4) different architectures
for distributed data (1 parameter server, k workers) and distributed model (k1 parameter servers,
k2 workers). 

This message was sent by Atlassian JIRA

View raw message