hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "CUDA On Hadoop" by ChenHe
Date Mon, 14 Mar 2011 16:12:14 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "CUDA On Hadoop" page has been changed by ChenHe.


  === For C/C++ programmers ===
+ We employ CUDA SDK programs in our experiments. For CUDA SDK programs, we first digested
the code and partitioned the program into portions for data generation, bootstrapping, and
CUDA kernels, with the former two components transformed respectively into a standalone data
generator and a virtual method callable from the map method in our MapRed utility class. The
CUDA kernel is kept as-is since we want to perform the same computation on the GPU only in
a distributed fashion. The data generator is augmented with the feature for taking command-line
arguments such that we can specify input sizes and output location for different experiment
runs. We reuse the code for boot-strapping a kernel execution into part of the mapper workload,
thus providing a seamless integration of CUDA and Hadoop.
+ The architecture of the ported CUDA SDK programs onto Hadoop is shown in Figure 2. For reusability,
we have used object-oriented design by abstracting the mapper and reducer functions into a
base class, i.e., MapRed. For different computing, we can override the following virtual methods
defined by MapRed:
+ • void processHadoopData(string& input);
+ • void cudaCompute(std::map<string,string>& output);
+ The processHadoopData method provides a hook for the CUDA program to initialize its internal
data structures by parsing the input passed from the Hadoop DFS. Thereafter, MapRed invokes
the cudaCompute method, in which the CUDA kernel is launched. The results of the computation
are stored in the map object and sent over to DFS for reduction.
+ With all the parts ready, we developed a set of scripts for launching the experiment. Specifically,
we perform the following steps in the scripts for all programs:
+ 1. Set up environment variables
+ 2. Generate input data
+ 3. Remove old data and upload new data onto Hadoop DFS
+ 4. Upload program binary onto Hadoop DFS (if changed)
+ 5. Remove the output directory from the previous run
+ 6. Submit the program as a MapReduce Job and start timer
+ 7. Report the runtime measurements and timestamps
+ Observing that the launching logic of all programs are very similar (only the arguments
to the input generators and program names differ), we have developed a common driver script
which is parameterized by each individual launch script and performs all the processing accordingly.
Thus, this modularization enables us to write launch scripts for new programs easily with
a few lines of code only.

View raw message