hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Lewis <lordjoe2...@gmail.com>
Subject Re: Using Hadoop infrastructure with input streams instead of key/value input
Date Mon, 03 Dec 2012 17:06:23 GMT
I presume a single file is handled by one and only one mapper. in that case
you can pass the path as a string and do something like this

       public void map(Object key, Text value, Context context
        ) throws IOException, InterruptedException {
       String hdfspath = value.toString();
       final FileSystem fs =  FileSystem.get(context.getConfiguration());
        Path src = new Path(hdfsPath);
        InputStream is = null;
    try {
         is =  fs.open(src);
        ... handle Stream
   finally {
           if(is != null)

      You might try streaming to a C program

On Mon, Dec 3, 2012 at 8:22 AM, Wheeler, Bill NPO <
bill.npo.wheeler@intel.com> wrote:

>  I am trying to use Hadoop’s partitioning/scheduling/storage
> infrastructure to process many HDFS files of data in parallel (1 HDFS file
> per map task), but in a way that does not naturally fit into the key/value
> pair input framework.  Specifically my application’s “map” function
> equivalent does not want to receive formatted data as key/value
> pairs—instead, I’d like to receive a Hadoop input stream object for my map
> processing so that I can read bytes out in many different ways with much
> greater flexibility and efficiency than what I’d get with the key/value
> pair input constraint.  The input stream would handle the complexity of
> fetching local and remote HDFS data blocks as needed on my behalf.  The
> result of the map processing would then conform to key/value pair map
> outputs and be subsequently processed by traditional reduce code.****
> ** **
> I’m guessing that I am not the only person who would like to read HDFS
> file input directly as this capability could open up a new type of Hadoop
> use models.  Is there any support for acquiring input streams directly into
> java map code?  And is there any support for doing the same into C++ map
> code ala Pipes?****
> ** **
> For added context, my application is in the video analytic space,
> requiring me to read video files .  I have implemented a solution, but it
> is a hack with less than ideal characteristics:  I have RecordReader code
> which simply passes the HDFS filename thru in the key field of my key/value
> input.  I’m using Pipes to implement the map function in C++ code.  The C++
> map code then performs a system call, “hadoop fs –copyToLocal hdfs_filename
> local_filename” to put the entire HDFS file on the datanode’s local file
> system where it is readable by C++ IO calls.  I then simply open up this
> file and process it.  It would be much better to avoid having to do all the
> extra IO associated with “copyToLocal” and instead somehow receive an input
> stream object from which to directly read from HDFS.****
> ** **
> Any way of doing this in a more elegant fashion?****
> ** **
> Thanks,****
> Bill****

Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

View raw message