hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jaliya Ekanayake" <jnekanay...@gmail.com>
Subject Re: Using Hadoop with executables and binary data
Date Thu, 20 Aug 2009 17:29:55 GMT
Hi Stefan,

 

I am sorry, for the late reply. Somehow the response email has slipped my
eyes.

Could you explain a bit on how to use Hadoop streaming with binary data
formats.

I can see, explanations on using it with text data formats, but not for
binary files.


Thank you,

Jaliya

Stefan Podkowinski
Mon, 10 Aug 2009 01:40:05 -0700

Jaliya,
 
did you consider Hadoop Streaming for your case?
http://wiki.apache.org/hadoop/HadoopStreaming
 
 
On Wed, Jul 29, 2009 at 8:35 AM, Jaliya
Ekanayake<jekan...@cs.indiana.edu> wrote:
> Dear Hadoop devs,
> 
> 
> 
> Please help me to figure out a way to program the following problem using
> Hadoop.
> 
> I have a program which I need to invoke in parallel using Hadoop. The
> program takes an input file(binary) and produce an output file (binary)
> 
> 
> 
> Input.bin ->prog.exe-> output.bin
> 
> 
> 
> The input data set is about 1TB in size. Each input data file is about
33MB
> in size. (So I have about 31000 files)
> 
> The output binary file is about 9KBs in size.
> 
> 
> 
> I have implemented this program using Hadoop in the following way.
> 
> 
> 
> I keep the input data in a shared parallel file system (Lustre File
System).
> 
> Then, I collect the input file names and write them to a collection of
files
> in HDFS (let's say hdfs_input_0.txt ..).
> 
> Each hdfs_input file contains roughly the equal number of files URIs to
the
> original input file.
> 
> The map task, simply take a string Value which is a URI to an original
input
> data file and execute the program as an external program.
> 
> The output of the program is also written to the shared file system
(Lustre
> File System).
> 
> 
> 
> The problem in this approach is I am not utilizing the true benefit of
> MapReduce. The use of local disks.
> 
> Could  you please suggest me a way to use local disks for the above
> problem.?
> 
> 
> 
> I thought, of the following way, but would like to verify from you if
there
> is a better way.
> 
> 
> 
> 1.       Upload the original data files in HDFS
> 
> 2.       In the map task, read the data file as an binary object.
> 
> 3.       Save it in the local file system.
> 
> 4.       Call the executable
> 
> 5.       Push the output from the local file system to HDFS.
> 
> 
> 
> Any suggestion is greatly appreciated.
> 
> 
> Thank you,
> 
> Jaliya
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 

 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message