hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sanjay Dahiya <sanj...@yahoo-inc.com>
Subject MapReduce C++ API
Date Tue, 16 May 2006 12:00:26 GMT
Hi -

I am working on the C++ API for MapReduce. Following are a few ways  
to implement it, I am looking for feedback on how
users may want to use this API, which will impact how we should  
design it.

For Mapper and Reducer its fairly straightforward to provide a Java  
Class which acts as a proxy to channel calls to C++
code using JNI. The problem is with the data IO to Map and reduce  
functions. The complexity depends on how much
flexibility do we want to provide to the C++ code in terms of  
defining the data format.

1. Allow Users of API to define custom data format for records  
consumed/produced by Map and Reduce. It can be
implemented in 2 ways

a) Make MapReduce(java) framework aware of the data format by  
exporting IO formats and reading/writing to a native
implementation. In this case effectively all(most) of the classes  
involved in data read/write will have a proxy (native)
implementation which could be configured in jobcof. Implementation of  
native methods will be in a typical JNI lib. In
this case we can avoid moving data(explicitely by serializing)  
between C++ and Java code. This will increase the
complexity on user's part but gives more flexibility in defining data  
formats. In this case user will not need to write any Java code.

b) Use explicit serialization between java and C++ code using Record  
IO. In this case user defines the data format with
Record IO and generates C++/Java classes. Also user need to write  
wrappers for the generated classes which will
implement Writable/WriteableComparable. In this case the IO classes  
will not have native implementation and only
Mapper/Reducer need a native implementation with some helper classes  
which can serialize/deserialize data. User will
need knowledge of both Java and C++ (and Record IO).

2. User use a single/standard data formats.

a) MapReduce(java) framework is not aware of the actual record  
format. It deals with a single format which could be any
one of the available formats or strings. All interchange with C++  
through JNI is done in the form of strings, C++ code
can convert this raw data in any structure using JUTE or any other  
data binding tool. After processing the data C++
code will return the standard object back to Java. In this case if  
users are using the same record format then they
need to define only Map and Reduce functions, rest is taken care of  
by the framework using helper methods, otherwise
they also need to take care of serializing and deserializing data.

b) Map and reduce work on string records only. Export Map and Reduce  
methods to native code with a Java proxy. Users
need only implement these two methods and link the library. Rest is  
taken care of by the framework.


View raw message