hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Albert Strasheim" <full...@gmail.com>
Subject Performing exactly one map operation per file
Date Sun, 08 Apr 2007 09:53:58 GMT
Hello all

I'm a new Hadoop user and I'm looking at using Hadoop for a distributed 
machine learning application.

For my application (and probably many machine learning applications), one 
would probably want to do something like the following:

1. Upload a bunch of images/audio/whatever to the DFS
2. Run a map operation to do something like:
2.1 perform some transformation on each image, creating N new images
2.2 convert the audio into feature vectors, storing all the feature vectors 
from a single audio file in a new file
3. Store the output of these map operations in the DFS

In general, one wants to take a dataset with N discrete items, and map them 
to N other items. Each item can typically be mapped independently of the 
other items, so this distributes nicely. However, each item must be sent to 
the map operation as a unit.

I've looked through the Hadoop wiki and the code and so far I've come up 
with the following:

- HadoopStreaming will be useful, since my algorithms can be implemented as 
C++ or Python programs
- I probably want to use an IdentityReducer to achieve what I outlined above

>From what I understood from running the sample programs, Hadoop splits up 
input files and passes the pieces to the map operations. However, I can't 
quite figure out how one would create a job configuration that maps a single 
file at a time instead of splitting the file (which isn't what one wants 
when dealing with images or audio).

Does anybody have some ideas on how to accomplish this? I'm guessing some 
new code might have to be written, so any pointers on where to start would 
be much appreciated.

Thanks for your time.



View raw message