reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Weimer (JIRA)" <>
Subject [jira] [Created] (REEF-1247) Add compatibility to Hadoop InputFormats to REEF.NET
Date Fri, 11 Mar 2016 19:13:39 GMT
Markus Weimer created REEF-1247:

             Summary: Add compatibility to Hadoop InputFormats to REEF.NET
                 Key: REEF-1247
             Project: REEF
          Issue Type: New Feature
          Components: REEF.NET, REEF.NET IO
            Reporter: Markus Weimer

h1. Problem

Data access on Hadoop clusters is mediated through the {{[InputFormat|]}}
interface and related code ({{RecordReader}},{{InputSplit}}, ...). It allows for access to
records stored in a partitioned data source. It is the equivalent of the {{IPartitionedDataSet}}
interface in REEF.NET. 

The Hadoop ecosystem provides a rich set of {{InputFormat}} implementations to Java and JVM
based languages. At the same time, .NET developers need access to all the data stored in those

The traditional way to do so is something like [Hadoop Streaming|],
which allows individual Map and Reduce implementations to be provided as separate executables
spawened by the Java Map and Reduce implementations. Data is funnelled to and from those side
processes via {{STDIN}} and {{STDOUT}} on a per record basis.  This approach is undesirable
for REEF.NET for a number of reasons:

  # Funneling each record across the process boundary is slow.
  # REEF does not expose a programming model. Hence, there is no equivalent of a "Map" to
implement in C#.
  # Managing two processes in the same container is error-prone, as they need to coordinate
resource use.

Hence, we need a more generic solution that effectively bridges the {{InputFormat}} (Java)
and {{IPartitionedDataSet}} (.NET) worlds.    
h1. Proposed solution

{{IPartitionedDataSet}} and {{InputFormat}} solve similar problems on both the Driver and
the Evaluator side: On the Driver side, {{IPartitionedDataSet}} provides {{PartitionDescriptor}}
instances used to configure the {{IPartition}} readers for the Evaluators. {{InputFormat}}
provides the {{InputSplits}} and the configuration needed to instantiated {{RecordReader}}
on the Evaluator side. This allows us to implement {{IPartitionedDataSet}} using {{InputFormat}}:

h2. Driver side

On the Driver, we need an {{IPartitionedDataSet}} implementation that
  # accepts an input specification, 
  # launches an external Java program that uses {{InputFormat}} to generate the {{InputSplit}}s
  # then collects the {{InputSplit}}s into {{PartitionDescriptor}}s.

h2. Evaluator side

On the Evaluator, we need a {{IPartition}} implementation that receives the {{InputSplit}}
definition from the Driver. It then forks a Java program to download that split into a tenp
file on the local file system of the container. The {{IPartition}} implementation then reads
that file and returns it to its clients.

h1. Discussion

The solution above addresses 2 of the 3 issues with the streaming based approach:

  # It downloads whole partitions, thereby skipping the overhead of funneling individual records
across the process boundary.
  # This approach is programming model neutral.

However, it still relies on external Java processes, which means that it still suffers from
the downsides of having multiple processes in a container. On the plus side, those Java processes
are relatively short lived and simple, so their memory footprint should be controllable.

This message was sent by Atlassian JIRA

View raw message