Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of eternalyouth@gmail.com
 designates 209.85.213.48 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:reply-to:in-reply-to:references:date:message-id
         :subject:from:to:content-type;
        b=UARnWdqKmkTW+YvushK8j6zgS+6loPqk0Tw2aiKoiVldM34DL0QXIhQNdP1YzRvU5C
         ZdjYIzS2w1E8jqxZFpUwJUSHRSRVoT6uRQDG+zAVGDIIQEDJib/ZlvbVsDiGAapr5exg
         /3AiPe7jhn5DOq0YypHAPO7Yo/1nonTowFg0g=
MIME-Version: 1.0
Reply-To: eternalyouth@gmail.com
In-Reply-To: <698229B4-1DC5-488F-8584-C108F03E2FEC@cern.ch>
References: <698229B4-1DC5-488F-8584-C108F03E2FEC@cern.ch>
Date: Wed, 22 Jun 2011 17:00:55 +0200
Message-ID: <BANLkTinWw9mcJbo8t=fkw0bWz-K2_Pfjnw@mail.gmail.com>
Subject: Re: Parallelize a workflow using mapReduce
From: Bibek Paudel <eternalyouth@gmail.com>
To: mapreduce-user@hadoop.apache.org
Content-Type: text/plain; charset=ISO-8859-1

On Wed, Jun 22, 2011 at 2:51 PM, Hassen Riahi <hassen.riahi@cern.ch> wrote:
> Hi all,
>
> I'm looking to parallelize a workflow using mapReduce. The workflow can be
> summarized as following:
>
> 1- Specify the list of paths of binary files to process in a configuration
> file (let's call this configuration file CONFIG). These binary files are
> stored in HDFS. This list of path can vary from 1 files to 10000* files.
> 2- Process the list of files given in CONFIG: It is done by calling a
> command (let's call it commandX) and giving CONFIG as option, smthg like:
> commandX CONFIG. CommandX reads CONFIG and takes care to open the files,
> process them and generate then the output.
> 3- Merging...this step can be ignored for now.
>
> The only solutions that I'm seeing to port this workflow to mapReduce are:
>
> 1- Write a map code which takes as input a list of paths and then call
> appropriately commandX. But, AFAIK, the job will not be split and will run
> as a single mapReduce job over HDFS.
> 2- Read the input files, and then get the output of the read operation and
> pass it as input to map code. This solution implies a deeper and complicated
> modification of commandX.
>
> Any ideas, comments or suggestions would be appreciated.

Hi,
If you are looking for a hadoop-oriented solution to this, here is my
suggestion:

1. Create a HDFS directory with all your input files in it. If you
don't want to do this, create a JobConf object and add each input file
into it (maybe by reading your CONFIG).

2. Overrise FileInputFormat and return false from isSplitable method-
this causes each input file to be processed by a single mapper.

I hope I understood your problem properly, and my suggestion is the
kind you were looking for.

Bibek