hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amritanshu Shekhar <amritanshu.shek...@exponential.com>
Subject RE: how to read binary data from hdfs
Date Tue, 01 May 2012 13:15:56 GMT
Harsh,
Thanks for the input. Since my binary input file contains binary data records of fixed format
and the file contains fixed number of binary records, wouldn't it be simpler to use FSDataInputStream
to read binary data copied to HDFS as a byte array. I can simply copy a file containing HDFS
paths to inputDir and a map job would be invoked on each HDFS file. ex:

     FSDataInputStream stm = fileSys.open(filename, 4096);
     byte[] actual = new byte[128];
    stm.read(actual, 0, actual.length);
    stm.see(4096);
    stm.close();

Let me know if this approach would work and if a potentially better approach exists. I am
new to Hadoop so my question might seem too simplistic for some people.
Thanks,
Amritanshu

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Tuesday, May 01, 2012 6:21 PM
To: hdfs-user@hadoop.apache.org
Cc: mlortiz@uci.cu
Subject: Re: how to read binary data from hdfs

Amritanshu,

Implement your own custom InputFormat with a RecordReader and you can
read your files directly.

To learn how to implement custom readers/formats you can refer to an
example provided via sub-title "Processing a whole file as a record",
Page 206 | Chapter 7: MapReduce Types and Formats in Tom White's
Hadoop: The Definitive Guide, or you can read up the details on
http://developer.yahoo.com/hadoop/tutorial/module5.html#inputformat.

On Tue, May 1, 2012 at 3:42 PM, Amritanshu Shekhar
<amritanshu.shekhar@exponential.com> wrote:
> Hi Guys,
> I want to read binary data (produced by a C program) that is copied to HDFS using a java
program. The idea is that I would write a map-reduce job eventually  that would  use the
aforementioned programs output(the java program would read binary data and create a Java object
which the map function would use). I read about the sequence file format that hadoop supports
but converting the binary data using java serialization into sequence file format would add
another layer of complexity. Is there a simple no frills API  that I can use to read binary
data directly from HDFS. Any help/resources would be deeply appreciated.
> Thanks and Regards,
> Amritanshu



-- 
Harsh J

Mime
View raw message