hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: hadoop streaming with custom RecordReader class
Date Thu, 18 Oct 2012 05:58:10 GMT
(3)'s your problem for sure.

Try this:

mkdir mypackage
mv <class file> mypackage/
jar cvf NLineRecordReader.jar mypackage
[Use this jar]

On Thu, Oct 18, 2012 at 10:54 AM, Jason Wang <jason.j.wang@gmail.com> wrote:
> 1. I did try using NLineInputFormat, but this causes the
> "stream.map.input.ignoreKey" to no longer work.  As per the streaming
> documentation:
>
> "The configuration parameter is valid only if stream.map.input.writer.class
> is org.apache.hadoop.streaming.io.TextInputWriter.class."
>
> My mapper prefers the streaming stdin to not have the key as part of the
> input.  I could obviously parse that out in the mapper, but the mapper
> belongs to a 3rd party. This is why I tried to do the RecordReader route.
>
> 2. Yes - I did export the classpath before running.
>
> 3. This may be the problem:
>
> bash-3.2$ jar -tf NLineRecordReader.jar
> META-INF/
> META-INF/MANIFEST.MF
> NLineRecordReader.class
>
> I have specified "package mypackage;" at the top of the java file though.
> Then compiled using "javac" and then "jar cf".
>
> 4. The class is public.
>
>
>
> On Wed, Oct 17, 2012 at 11:53 PM, Harsh J <harsh@cloudera.com> wrote:
>>
>> Hi Jason,
>>
>> A few questions (in order):
>>
>> 1. Does Hadoop's own NLineInputFormat not suffice?
>>
>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>>
>> 2. Do you make sure to pass your jar into the front-end too?
>>
>> $ export HADOOP_CLASSPATH=/path/to/your/jar
>> $ command…
>>
>> 3. Does jar -tf <yourjar> carry a proper mypackage.NLineRecordReader?
>>
>> 4. Is your class marked public?
>>
>> On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang <jason.j.wang@gmail.com>
>> wrote:
>> > Hi all,
>> > I'm experimenting with hadoop streaming on build 1.0.3.
>> >
>> > To give background info, i'm streaming a text file into mapper written
>> > in C.
>> > Using the default settings, streaming uses TextInputFormat which creates
>> > one
>> > record from each line.  The problem I am having is that I need record
>> > boundaries to be every 4 lines.  When the splitter breaks up the input
>> > into
>> > the mappers, I have partial records on the boundaries due to this.  To
>> > address this, my approach was to write a new RecordReader class almost
>> > in
>> > java that is almost identical to LineRecordReader, but with a modified
>> > next() method that reads 4 lines instead of one.
>> >
>> > I then compiled the new class and created a jar.  I wanted to import
>> > this at
>> > run time using the -libjars argument, like such:
>> >
>> > hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars
>> > NLineRecordReader.jar -files test_stream.sh -inputreader
>> > mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output
>> > /Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE
>> >
>> > Unfortunately, I keep getting the following error:
>> > -inputreader: class not found: mypackage.NLineRecordReader
>> >
>> > My question is 2 fold.  Am I using the right approach to handle the 4
>> > line
>> > records with the custom RecordReader implementation?  And why isn't
>> > -libjars
>> > working to include my class to hadoop streaming at runtime?
>> >
>> > Thanks,
>> > Jason
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Mime
View raw message