hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hassen Riahi <hassen.ri...@cern.ch>
Subject Parallelize a workflow using mapReduce
Date Wed, 22 Jun 2011 12:51:02 GMT
Hi all,

I'm looking to parallelize a workflow using mapReduce. The workflow  
can be summarized as following:

1- Specify the list of paths of binary files to process in a  
configuration file (let's call this configuration file CONFIG). These  
binary files are stored in HDFS. This list of path can vary from 1  
files to 10000* files.
2- Process the list of files given in CONFIG: It is done by calling a  
command (let's call it commandX) and giving CONFIG as option, smthg  
like: commandX CONFIG. CommandX reads CONFIG and takes care to open  
the files, process them and generate then the output.
3- Merging...this step can be ignored for now.

The only solutions that I'm seeing to port this workflow to mapReduce  
are:

1- Write a map code which takes as input a list of paths and then call  
appropriately commandX. But, AFAIK, the job will not be split and will  
run as a single mapReduce job over HDFS.
2- Read the input files, and then get the output of the read operation  
and pass it as input to map code. This solution implies a deeper and  
complicated modification of commandX.

Any ideas, comments or suggestions would be appreciated.

Thanks in advance for the help,
Hassen

Mime
View raw message