hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wheeler, Bill NPO" <bill.npo.whee...@intel.com>
Subject Using Hadoop infrastructure with input streams instead of key/value input
Date Mon, 03 Dec 2012 16:22:53 GMT
I am trying to use Hadoop's partitioning/scheduling/storage infrastructure to process many
HDFS files of data in parallel (1 HDFS file per map task), but in a way that does not naturally
fit into the key/value pair input framework.  Specifically my application's "map" function
equivalent does not want to receive formatted data as key/value pairs-instead, I'd like to
receive a Hadoop input stream object for my map processing so that I can read bytes out in
many different ways with much greater flexibility and efficiency than what I'd get with the
key/value pair input constraint.  The input stream would handle the complexity of fetching
local and remote HDFS data blocks as needed on my behalf.  The result of the map processing
would then conform to key/value pair map outputs and be subsequently processed by traditional
reduce code.

I'm guessing that I am not the only person who would like to read HDFS file input directly
as this capability could open up a new type of Hadoop use models.  Is there any support for
acquiring input streams directly into java map code?  And is there any support for doing the
same into C++ map code ala Pipes?

For added context, my application is in the video analytic space, requiring me to read video
files .  I have implemented a solution, but it is a hack with less than ideal characteristics:
 I have RecordReader code which simply passes the HDFS filename thru in the key field of my
key/value input.  I'm using Pipes to implement the map function in C++ code.  The C++ map
code then performs a system call, "hadoop fs -copyToLocal hdfs_filename local_filename" to
put the entire HDFS file on the datanode's local file system where it is readable by C++ IO
calls.  I then simply open up this file and process it.  It would be much better to avoid
having to do all the extra IO associated with "copyToLocal" and instead somehow receive an
input stream object from which to directly read from HDFS.

Any way of doing this in a more elegant fashion?


View raw message