Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id D4B95200B45 for ; Fri, 1 Jul 2016 01:14:11 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id D35BD160A71; Thu, 30 Jun 2016 23:14:11 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 2659A160A52 for ; Fri, 1 Jul 2016 01:14:11 +0200 (CEST) Received: (qmail 48147 invoked by uid 500); 30 Jun 2016 23:14:10 -0000 Mailing-List: contact dev-help@reef.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@reef.apache.org Delivered-To: mailing list dev@reef.apache.org Received: (qmail 48113 invoked by uid 99); 30 Jun 2016 23:14:10 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 Jun 2016 23:14:10 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 353CF2C027F for ; Thu, 30 Jun 2016 23:14:10 +0000 (UTC) Date: Thu, 30 Jun 2016 23:14:10 +0000 (UTC) From: "Byung-Gon Chun (JIRA)" To: dev@reef.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (REEF-1477) Provide a data-centric API for stitching REEF jobs together MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 30 Jun 2016 23:14:12 -0000 [ https://issues.apache.org/jira/browse/REEF-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15358022#comment-15358022 ] Byung-Gon Chun edited comment on REEF-1477 at 6/30/16 11:13 PM: ---------------------------------------------------------------- Isn't this adding a workflow API? If the job dependency is straightforward, pipeline API? E.g., Spark pipeline API. http://spark.apache.org/docs/latest/ml-guide.html#main-concepts-in-pipelines was (Author: bgchun): Isn't this adding a workflow API? > Provide a data-centric API for stitching REEF jobs together > ----------------------------------------------------------- > > Key: REEF-1477 > URL: https://issues.apache.org/jira/browse/REEF-1477 > Project: REEF > Issue Type: New Feature > Components: REEF.NET > Reporter: Joo Seong (Jason) Jeong > > The typical flow of using REEF to run machine learning data analytics involves submitting several REEF jobs one at a time, each producing some trained model, intermediate data, or other analysis results. Connecting the jobs together, e.g. using a previously trained model to perform predictions on a test dataset, must be separately managed by the user. For a long series of REEF jobs, this is certainly not desirable - we would like to be able to stitch a sequence of REEF jobs in a declarative fashion. Moreover, as REEF's name suggests, we should reuse resources for consecutive jobs when possible. > This can be achieved by providing a data-centric API for running REEF that focuses on the objects instead of REEF program details: > {code} > // example > var trainData = load("hdfs://.../"); > var model = trainData.RunIMRU(jobSpec); > var testData = load("hdfs://.../"); > var transformedData = testData.ApplyTransform(transform); > var results = transformedData.RunIMRU(jobSpecAndModel); > results.Store("hdfs://.../"); > {code} > Each method call on datasets will start a new REEF job on Evaluators - not necessarily a new Driver - and return an object that can be reused later. Users only need to provide the job spec of each stage and not how the stages get linked with each other. Through this API, constructing a pipeline of data analytics on REEF will get easier and more intuitive. > This JIRA will serve as an umbrella for the related issues to provide such an API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)