reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joo Seong (Jason) Jeong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (REEF-1477) Provide a data-centric API for stitching REEF jobs together
Date Fri, 01 Jul 2016 04:15:11 GMT

    [ https://issues.apache.org/jira/browse/REEF-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15358352#comment-15358352
] 

Joo Seong (Jason) Jeong commented on REEF-1477:
-----------------------------------------------

Yes, [~bgchun], I guess we can say we're adding a pipeline interface for expressing sequences
of ML stages. Moreover, this API is greatly extendible in the sense that new type of 'stages'
can be added - ML abstractions like IMRU - and linked with existing code without modifying
the core API.

> Provide a data-centric API for stitching REEF jobs together
> -----------------------------------------------------------
>
>                 Key: REEF-1477
>                 URL: https://issues.apache.org/jira/browse/REEF-1477
>             Project: REEF
>          Issue Type: New Feature
>          Components: REEF.NET
>            Reporter: Joo Seong (Jason) Jeong
>
> The typical flow of using REEF to run machine learning data analytics involves submitting
several REEF jobs one at a time, each producing some trained model, intermediate data, or
other analysis results. Connecting the jobs together, e.g. using a previously trained model
to perform predictions on a test dataset, must be separately managed by the user. For a long
series of REEF jobs, this is certainly not desirable -  we would like to be able to stitch
a sequence of REEF jobs in a declarative fashion. Moreover, as REEF's name suggests, we should
reuse resources for consecutive jobs when possible.
> This can be achieved by providing a data-centric API for running REEF that focuses on
the objects instead of REEF program details:
> {code}
> // example
> var trainData = load("hdfs://.../");
> var model = trainData.RunIMRU(jobSpec);
> var testData = load("hdfs://.../");
> var transformedData = testData.ApplyTransform(transform);
> var results = transformedData.RunIMRU(jobSpecAndModel);
> results.Store("hdfs://.../");
> {code}
> Each method call on datasets will start a new REEF job on Evaluators - not necessarily
a new Driver - and return an object that can be reused later. Users only need to provide the
job spec of each stage and not how the stages get linked with each other. Through this API,
constructing a  pipeline of data analytics on REEF will get easier and more intuitive.
> This JIRA will serve as an umbrella for the related issues to provide such an API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message