hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <ar...@yahoo-inc.com>
Subject Re: Performing exactly one map operation per file
Date Sun, 08 Apr 2007 11:25:21 GMT
Hi Albert,

On Sun, Apr 08, 2007 at 11:53:58AM +0200, Albert Strasheim wrote:
>Hello all
>I'm a new Hadoop user and I'm looking at using Hadoop for a distributed 
>machine learning application.

Welcome to Hadoop!

Here is a broad outline of how hadoop's map-reduce framework works specifically for user inputs/formats
a) User specifies the input directory via JobConf.setInputPath or mapred.input.dir in the
.xml file.
b) User specifies the format of the input files so that the framework can then decide how
to break the data into 'records' i.e. key/value pairs which are then sent to the user defined
map/reduce apis. I suspect you will have to come up with your own InputFormat class (depending
on audio/image/video files etc.) by subclassing from org.apache.hadoop.mapred.InputFormatBase
and also a org.apache.hadoop.mapred.RecordReader (which actually reads individual key/value
pairs). There are some examples in org.apache.hadoop.mapred package for both the above: TextInputFormat/LineRecordReader
and SequenceFileInputFormat/SequenceFileRecordReader; usually they come in pairs.

>>From what I understood from running the sample programs, Hadoop splits up 
>input files and passes the pieces to the map operations. However, I can't 
>quite figure out how one would create a job configuration that maps a 
>single file at a time instead of splitting the file (which isn't what one 
>wants when dealing with images or audio).

 The InputFormatBase defines an 'isSplitable' api which is used by the framework to deduce
whether the mapred framework splits up the input files. You could trivially turn this off
by returning 'false' for your {Audio|Video|Image}InputFormat classes.

>- HadoopStreaming will be useful, since my algorithms can be implemented as 
>C++ or Python programs

The C++ map-reduce api that Owen has been working on might interest you: http://issues.apache.org/jira/browse/HADOOP-234.


View raw message