Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 690439259 for ; Tue, 11 Oct 2011 13:08:58 +0000 (UTC) Received: (qmail 43008 invoked by uid 500); 11 Oct 2011 13:08:57 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 42962 invoked by uid 500); 11 Oct 2011 13:08:56 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 42954 invoked by uid 99); 11 Oct 2011 13:08:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Oct 2011 13:08:56 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of stefano.alberto.russo@cern.ch designates 137.138.144.177 as permitted sender) Received: from [137.138.144.177] (HELO CERNMX30.cern.ch) (137.138.144.177) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Oct 2011 13:08:47 +0000 Received: from CERNFE23.cern.ch (137.138.144.152) by cernmxgwlb2.cern.ch (137.138.144.177) with Microsoft SMTP Server (TLS) id 14.1.270.1; Tue, 11 Oct 2011 15:08:26 +0200 Received: from [137.138.32.154] (137.138.32.154) by smtp.cern.ch (137.138.144.172) with Microsoft SMTP Server (TLS) id 14.1.270.2; Tue, 11 Oct 2011 15:08:26 +0200 Message-ID: <4E943FC9.9010506@cern.ch> Date: Tue, 11 Oct 2011 15:08:25 +0200 From: Stefano Alberto Russo User-Agent: Thunderbird 2.0.0.24 (X11/20110323) MIME-Version: 1.0 To: Subject: Binary executable with binary data on Hadoop MapReduce Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [137.138.32.154] Keywords: CERN SpamKiller Note: -50 Hi all, I'm trying to use Hadoop MapReduce (new api) in a particular way. What I would like to do is to make it work with a external executable not made for mapreduce (but able to read from hdfs), and with binary input. The idea is to store on hdfs the binary input files, and then to run a mapreduce job specifing these files as input. Once the mapreduce task is landend to on the node, I would like to block it from reading the input data, but instead I would like it to spawn the precompiled executable to load the input data from hdfs. In this way, the mapreduce framework should have taken care in placing the mapper as closer as possible to the data, and consequently the binary spawned. I do not want to run the reduce, the aggreation (very fast for my computational problem) will be done via a simple script that will take care of downloading from hdfs the outputs (previously uploaded to hdfs from the spawned binary). I made some tests: I can obtain the file being currently "analyzed" by the mapper (to pass it to the the spawned binary) using: Configuration conf = context.getConfiguration(); FileSplit fileSplit = (FileSplit) context.getInputSplit(); String sFileName = fileSplit.getPath().toString(); I could avoid input binary files to be splitted using the "isSplitable" function in a new InputFormat(about performance: files will be usually smaller than block size) But I don't know how to block the map task from reading his input file: I was thinkg about something like defining a new RecordReader with records defined by the end of file, so that in the map() function of the mapper I can spawn the binary. But will this cause the entire file to be loaded in the memory? Or, is there a way to tell the MapReduce framework to do not automatically feed the map task using the push but instead to wait for the map task to pull? (and never pull?) Any help is appreciated! Thnakyou, Stefano.