Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F1F7F6F74 for ; Wed, 22 Jun 2011 15:01:24 +0000 (UTC) Received: (qmail 85505 invoked by uid 500); 22 Jun 2011 15:01:23 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 85376 invoked by uid 500); 22 Jun 2011 15:01:23 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 85368 invoked by uid 99); 22 Jun 2011 15:01:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Jun 2011 15:01:23 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of eternalyouth@gmail.com designates 209.85.213.48 as permitted sender) Received: from [209.85.213.48] (HELO mail-yw0-f48.google.com) (209.85.213.48) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Jun 2011 15:01:17 +0000 Received: by ywn13 with SMTP id 13so549344ywn.35 for ; Wed, 22 Jun 2011 08:00:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:reply-to:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=ZKN5VV88uMtWtJYYDwJAkWio55QKgtqKsj9b+h5GiZ4=; b=Kb5gkLyIU4cIR697uSShsDBmPiFFZj9pafnjd30+Ua0ucO2/INL47XQf8zyQy3Ar8+ 8zHsBKiA/FieAPenQ09ZvPWNSCm1fzH3fwceUJJgjgJedoinFPs+mtVi6CHqZwPAQcxE 4TY9kjSIbQ/DYmEkSlDuaorilNIhOpQW8a4x8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:reply-to:in-reply-to:references:date:message-id :subject:from:to:content-type; b=UARnWdqKmkTW+YvushK8j6zgS+6loPqk0Tw2aiKoiVldM34DL0QXIhQNdP1YzRvU5C ZdjYIzS2w1E8jqxZFpUwJUSHRSRVoT6uRQDG+zAVGDIIQEDJib/ZlvbVsDiGAapr5exg /3AiPe7jhn5DOq0YypHAPO7Yo/1nonTowFg0g= MIME-Version: 1.0 Received: by 10.43.50.73 with SMTP id vd9mr829838icb.472.1308754855840; Wed, 22 Jun 2011 08:00:55 -0700 (PDT) Received: by 10.231.32.193 with HTTP; Wed, 22 Jun 2011 08:00:55 -0700 (PDT) Reply-To: eternalyouth@gmail.com In-Reply-To: <698229B4-1DC5-488F-8584-C108F03E2FEC@cern.ch> References: <698229B4-1DC5-488F-8584-C108F03E2FEC@cern.ch> Date: Wed, 22 Jun 2011 17:00:55 +0200 Message-ID: Subject: Re: Parallelize a workflow using mapReduce From: Bibek Paudel To: mapreduce-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org On Wed, Jun 22, 2011 at 2:51 PM, Hassen Riahi wrote: > Hi all, > > I'm looking to parallelize a workflow using mapReduce. The workflow can be > summarized as following: > > 1- Specify the list of paths of binary files to process in a configuration > file (let's call this configuration file CONFIG). These binary files are > stored in HDFS. This list of path can vary from 1 files to 10000* files. > 2- Process the list of files given in CONFIG: It is done by calling a > command (let's call it commandX) and giving CONFIG as option, smthg like: > commandX CONFIG. CommandX reads CONFIG and takes care to open the files, > process them and generate then the output. > 3- Merging...this step can be ignored for now. > > The only solutions that I'm seeing to port this workflow to mapReduce are: > > 1- Write a map code which takes as input a list of paths and then call > appropriately commandX. But, AFAIK, the job will not be split and will run > as a single mapReduce job over HDFS. > 2- Read the input files, and then get the output of the read operation and > pass it as input to map code. This solution implies a deeper and complicated > modification of commandX. > > Any ideas, comments or suggestions would be appreciated. Hi, If you are looking for a hadoop-oriented solution to this, here is my suggestion: 1. Create a HDFS directory with all your input files in it. If you don't want to do this, create a JobConf object and add each input file into it (maybe by reading your CONFIG). 2. Overrise FileInputFormat and return false from isSplitable method- this causes each input file to be processed by a single mapper. I hope I understood your problem properly, and my suggestion is the kind you were looking for. Bibek