reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Weimer (JIRA)" <>
Subject [jira] [Commented] (REEF-1477) Provide a data-centric API for stitching REEF jobs together
Date Thu, 30 Jun 2016 16:49:10 GMT


Markus Weimer commented on REEF-1477:

+1 overall :)

For the {{.RunIMRU}} method, it would be good to also think about an extension mechanism.
I'd rather not end up in a situation where every ML abstraction du jour requires deep changes
to this (likely) central future REEF API. To enable that, we had two lines of thought in the
past (before the days of Apache REEF, and no useful code to contribute was produced):

  * *DAGR* (DAGs on REEF): The idea here is to model the computation as a DAG of Drivers that
take over the Evaluators (and data on them). For instance, via a method like {{trainData.RunDriver(IConfiguration
driverConfiguration)}}. Of course, this wouldn't literally be a current REEF Driver, as it
is handed data and Evaluators as input. Instead, we'd define a "Mini Driver" API for that.
  * *CDR*, which stands for Cloud Data Runtime. In  solves the same challenge differently:
Here, each node of the DAG is modeled as a control flow master that is *not* as powerful as
a Driver. Instead, these "Stage Controllers" emit Tasks to be scheduled on the Evaluators.
This allows a framework to go in between the Stage Controller and the Evaluators to enable
more global scheduling than in the DAGR model. This is similar to what Spark has with the
concept of a {{StageScheduler}}/{{TaskScheduler}}. Crucially, we'd need gang scheduling in
the latter to enable HPC setups like in IMRU.

Of the two, DAGR is easier to implement, especially if we restrict ourselves to one "Mini
Driver" to be active at any given time. It is also more flexible, as any scheduling primitive
can be implemented by a Mini Driver without change to the core APIs discussed here. However,
CDR opens a wider field of potential future optimizations in that core.

In the spirit of moving fast to a prototype we can analyze, I'm somewhat more in favor of
the DAGR approach right now. If it comes to it, we can make CDR a Mini Driver :)

> Provide a data-centric API for stitching REEF jobs together
> -----------------------------------------------------------
>                 Key: REEF-1477
>                 URL:
>             Project: REEF
>          Issue Type: New Feature
>          Components: REEF.NET
>            Reporter: Joo Seong (Jason) Jeong
> The typical flow of using REEF to run machine learning data analytics involves submitting
several REEF jobs one at a time, each producing some trained model, intermediate data, or
other analysis results. Connecting the jobs together, e.g. using a previously trained model
to perform predictions on a test dataset, must be separately managed by the user. For a long
series of REEF jobs, this is certainly not desirable -  we would like to be able to stitch
a sequence of REEF jobs in a declarative fashion. Moreover, as REEF's name suggests, we should
reuse resources for consecutive jobs when possible.
> This can be achieved by providing a data-centric API for running REEF that focuses on
the objects instead of REEF program details:
> {code}
> // example
> var trainData = load("hdfs://.../");
> var model = trainData.RunIMRU(jobSpec);
> var testData = load("hdfs://.../");
> var transformedData = testData.ApplyTransform(transform);
> var results = transformedData.RunIMRU(jobSpecAndModel);
> results.Store("hdfs://.../");
> {code}
> Each method call on datasets will start a new REEF job on Evaluators - not necessarily
a new Driver - and return an object that can be reused later. Users only need to provide the
job spec of each stage and not how the stages get linked with each other. Through this API,
constructing a  pipeline of data analytics on REEF will get easier and more intuitive.
> This JIRA will serve as an umbrella for the related issues to provide such an API.

This message was sent by Atlassian JIRA

View raw message