hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Wang <jason.j.w...@gmail.com>
Subject hadoop streaming with custom RecordReader class
Date Thu, 18 Oct 2012 04:02:43 GMT
Hi all,
I'm experimenting with hadoop streaming on build 1.0.3.

To give background info, i'm streaming a text file into mapper written in
C.  Using the default settings, streaming uses TextInputFormat which
creates one record from each line.  The problem I am having is that I need
record boundaries to be every 4 lines.  When the splitter breaks up the
input into the mappers, I have partial records on the boundaries due to
this.  To address this, my approach was to write a new RecordReader class
almost in java that is almost identical to LineRecordReader, but with a
modified next() method that reads 4 lines instead of one.

I then compiled the new class and created a jar.  I wanted to import this
at run time using the -libjars argument, like such:

hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars
NLineRecordReader.jar -files test_stream.sh -inputreader
mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output
/Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE

Unfortunately, I keep getting the following error:
-inputreader: class not found: mypackage.NLineRecordReader

My question is 2 fold.  Am I using the right approach to handle the 4 line
records with the custom RecordReader implementation?  And why isn't
-libjars working to include my class to hadoop streaming at runtime?

Thanks,
Jason

Mime
View raw message