hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hemanth Yamijala <yhema...@thoughtworks.com>
Subject Re: Using Hadoop infrastructure with input streams instead of key/value input
Date Tue, 04 Dec 2012 08:36:49 GMT

I have not tried this myself before, but would libhdfs help ?



On Mon, Dec 3, 2012 at 9:52 PM, Wheeler, Bill NPO <
bill.npo.wheeler@intel.com> wrote:

>  I am trying to use Hadoop’s partitioning/scheduling/storage
> infrastructure to process many HDFS files of data in parallel (1 HDFS file
> per map task), but in a way that does not naturally fit into the key/value
> pair input framework.  Specifically my application’s “map” function
> equivalent does not want to receive formatted data as key/value
> pairs—instead, I’d like to receive a Hadoop input stream object for my map
> processing so that I can read bytes out in many different ways with much
> greater flexibility and efficiency than what I’d get with the key/value
> pair input constraint.  The input stream would handle the complexity of
> fetching local and remote HDFS data blocks as needed on my behalf.  The
> result of the map processing would then conform to key/value pair map
> outputs and be subsequently processed by traditional reduce code.****
> ** **
> I’m guessing that I am not the only person who would like to read HDFS
> file input directly as this capability could open up a new type of Hadoop
> use models.  Is there any support for acquiring input streams directly into
> java map code?  And is there any support for doing the same into C++ map
> code ala Pipes?****
> ** **
> For added context, my application is in the video analytic space,
> requiring me to read video files .  I have implemented a solution, but it
> is a hack with less than ideal characteristics:  I have RecordReader code
> which simply passes the HDFS filename thru in the key field of my key/value
> input.  I’m using Pipes to implement the map function in C++ code.  The C++
> map code then performs a system call, “hadoop fs –copyToLocal hdfs_filename
> local_filename” to put the entire HDFS file on the datanode’s local file
> system where it is readable by C++ IO calls.  I then simply open up this
> file and process it.  It would be much better to avoid having to do all the
> extra IO associated with “copyToLocal” and instead somehow receive an input
> stream object from which to directly read from HDFS.****
> ** **
> Any way of doing this in a more elegant fashion?****
> ** **
> Thanks,****
> Bill****

View raw message