reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joo Seong (Jason) Jeong (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (REEF-1479) Define interface for distributed dataset
Date Tue, 12 Jul 2016 23:54:20 GMT

    [ https://issues.apache.org/jira/browse/REEF-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15373950#comment-15373950
] 

Joo Seong (Jason) Jeong edited comment on REEF-1479 at 7/12/16 11:54 PM:
-------------------------------------------------------------------------

[~markus.weimer], [~shravanmn], [~dkm2110], [~dss-2009@yandex.ru] and I talked offline about
the following:
* Hide the Block notion altogether and make it an internal unit for potential 'block managers.'
The Block interface is purely for the physical layer, and is confusing for upper layers which
may have their own 'blocks.'
* Expose some way of accessing the metadata of partitions to {{IDataSet}} users.

We still need more discussion on
* The delegation of REEF events from the REEF Driver to the MiniDriver.
* The actual block manager implementation.
* The output for operations and the scope of outputs in the pipeline.


was (Author: jsjason):
[~markus.weimer], [~shravanmn], [~dkm2110], [~dss-2009@yandex.ru] and I talked offline about
the following:
* Hide the Block notion altogether and make it an internal unit for potential 'block managers.'
The Block interface is purely for the physical layer, and is confusing for upper layers which
may have their own 'blocks.'
* Expose some way of accessing the metadata of partitions to {{IDataSet}} users.

We still need more discussion on
* The delegation of REEF events from the REEF Driver to the MiniDriver.
* The actual block manager implementation.

> Define interface for distributed dataset 
> -----------------------------------------
>
>                 Key: REEF-1479
>                 URL: https://issues.apache.org/jira/browse/REEF-1479
>             Project: REEF
>          Issue Type: Sub-task
>          Components: REEF.NET
>            Reporter: Joo Seong (Jason) Jeong
>
> As a first step of [REEF-1477|https://issues.apache.org/jira/browse/REEF-1477], we'd
like to define an interface for the distributed dataset that we will work with. This dataset
interface serves as an abstraction of many dataset partitions, one on each Evaluator. In some
sense, the class {{IPartitionedInputDataSet}} is very similar to what we want, except that
the new interface will contain action methods like {{RunIMRU}} or {{RunTransform}}.
> {code}
> interface IDataSet<T> {
>   // apply a transform to this dataset
>   // transformConf gets shipped to each partition
>   // partition-wise operation
>   IDataSet<T'> TransformPartitions(IConfiguration transformConf);
>   // general interface for applying operations
>   // aware of all partitions, compared to TransformPartitions()
>   IDataSet<T'> RunStage(IConfiguration stageConf);
>   // fetch the actual data to the local process
>   // may result in OutOfMemory exception if T is too large
>   T[] Collect();
> }
> {code}
> Writing the data on stable storage via a {{Store()}} method can be considered as a special
case of {{RunStage()}}. On the other hand, {{Load()}} must be defined in a separate interface/class
and may be dependent on the backing filesystem.
> Other interfaces that allow 'Stage' implementations and partition access from Tasks must
also be newly defined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message