reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Weimer (JIRA)" <>
Subject [jira] [Commented] (REEF-1247) Add compatibility to Hadoop InputFormats to REEF.NET
Date Fri, 18 Mar 2016 21:04:33 GMT


Markus Weimer commented on REEF-1247:

Thanks, [~motus] for picking this up! A first step is probably to split the work into smaller

> Add compatibility to Hadoop InputFormats to REEF.NET
> ----------------------------------------------------
>                 Key: REEF-1247
>                 URL:
>             Project: REEF
>          Issue Type: New Feature
>          Components: REEF.NET, REEF.NET IO
>            Reporter: Markus Weimer
>            Assignee: Sergiy Matusevych
> h1. Problem
> Data access on Hadoop clusters is mediated through the {{[InputFormat|]}}
interface and related code ({{RecordReader}},{{InputSplit}}, ...). It allows for access to
records stored in a partitioned data source. It is the equivalent of the {{IPartitionedDataSet}}
interface in REEF.NET. 
> The Hadoop ecosystem provides a rich set of {{InputFormat}} implementations to Java and
JVM based languages. At the same time, .NET developers need access to all the data stored
in those formats.
> The traditional way to do so is something like [Hadoop Streaming|],
which allows individual Map and Reduce implementations to be provided as separate executables
spawened by the Java Map and Reduce implementations. Data is funnelled to and from those side
processes via {{STDIN}} and {{STDOUT}} on a per record basis.  This approach is undesirable
for REEF.NET for a number of reasons:
>   # Funneling each record across the process boundary is slow.
>   # REEF does not expose a programming model. Hence, there is no equivalent of a "Map"
to implement in C#.
>   # Managing two processes in the same container is error-prone, as they need to coordinate
resource use.
> Hence, we need a more generic solution that effectively bridges the {{InputFormat}} (Java)
and {{IPartitionedDataSet}} (.NET) worlds.    
> h1. Proposed solution
> {{IPartitionedDataSet}} and {{InputFormat}} solve similar problems on both the Driver
and the Evaluator side: On the Driver side, {{IPartitionedDataSet}} provides {{PartitionDescriptor}}
instances used to configure the {{IPartition}} readers for the Evaluators. {{InputFormat}}
provides the {{InputSplits}} and the configuration needed to instantiated {{RecordReader}}
on the Evaluator side. This allows us to implement {{IPartitionedDataSet}} using {{InputFormat}}:
> h2. Driver side
> On the Driver, we need an {{IPartitionedDataSet}} implementation that
>   # accepts an input specification, 
>   # launches an external Java program that uses {{InputFormat}} to generate the {{InputSplit}}s
>   # then collects the {{InputSplit}}s into {{PartitionDescriptor}}s.
> h2. Evaluator side
> On the Evaluator, we need a {{IPartition}} implementation that receives the {{InputSplit}}
definition from the Driver. It then forks a Java program to download that split into a tenp
file on the local file system of the container. The {{IPartition}} implementation then reads
that file and returns it to its clients.
> h1. Discussion
> The solution above addresses 2 of the 3 issues with the streaming based approach:
>   # It downloads whole partitions, thereby skipping the overhead of funneling individual
records across the process boundary.
>   # This approach is programming model neutral.
> However, it still relies on external Java processes, which means that it still suffers
from the downsides of having multiple processes in a container. On the plus side, those Java
processes are relatively short lived and simple, so their memory footprint should be controllable.

This message was sent by Atlassian JIRA

View raw message