hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Wang <jason.j.w...@gmail.com>
Subject Re: hadoop streaming with custom RecordReader class
Date Thu, 18 Oct 2012 20:03:10 GMT
Thanks a bunch Harsh, that was my problem.  Was strange because even with
no package specified, it was not able to find the class.  So it's working
now, though it seems that hadoop streaming ignores the specified
-inputreader class completely, but that's a different issue.

On Thu, Oct 18, 2012 at 12:58 AM, Harsh J <harsh@cloudera.com> wrote:

> Also, consider using Maven for these kinda development, helps build
> sane jars automatically :)
>
> On Thu, Oct 18, 2012 at 11:28 AM, Harsh J <harsh@cloudera.com> wrote:
> > (3)'s your problem for sure.
> >
> > Try this:
> >
> > mkdir mypackage
> > mv <class file> mypackage/
> > jar cvf NLineRecordReader.jar mypackage
> > [Use this jar]
> >
> > On Thu, Oct 18, 2012 at 10:54 AM, Jason Wang <jason.j.wang@gmail.com>
> wrote:
> >> 1. I did try using NLineInputFormat, but this causes the
> >> "stream.map.input.ignoreKey" to no longer work.  As per the streaming
> >> documentation:
> >>
> >> "The configuration parameter is valid only if
> stream.map.input.writer.class
> >> is org.apache.hadoop.streaming.io.TextInputWriter.class."
> >>
> >> My mapper prefers the streaming stdin to not have the key as part of the
> >> input.  I could obviously parse that out in the mapper, but the mapper
> >> belongs to a 3rd party. This is why I tried to do the RecordReader
> route.
> >>
> >> 2. Yes - I did export the classpath before running.
> >>
> >> 3. This may be the problem:
> >>
> >> bash-3.2$ jar -tf NLineRecordReader.jar
> >> META-INF/
> >> META-INF/MANIFEST.MF
> >> NLineRecordReader.class
> >>
> >> I have specified "package mypackage;" at the top of the java file
> though.
> >> Then compiled using "javac" and then "jar cf".
> >>
> >> 4. The class is public.
> >>
> >>
> >>
> >> On Wed, Oct 17, 2012 at 11:53 PM, Harsh J <harsh@cloudera.com> wrote:
> >>>
> >>> Hi Jason,
> >>>
> >>> A few questions (in order):
> >>>
> >>> 1. Does Hadoop's own NLineInputFormat not suffice?
> >>>
> >>>
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
> >>>
> >>> 2. Do you make sure to pass your jar into the front-end too?
> >>>
> >>> $ export HADOOP_CLASSPATH=/path/to/your/jar
> >>> $ command…
> >>>
> >>> 3. Does jar -tf <yourjar> carry a proper mypackage.NLineRecordReader?
> >>>
> >>> 4. Is your class marked public?
> >>>
> >>> On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang <jason.j.wang@gmail.com>
> >>> wrote:
> >>> > Hi all,
> >>> > I'm experimenting with hadoop streaming on build 1.0.3.
> >>> >
> >>> > To give background info, i'm streaming a text file into mapper
> written
> >>> > in C.
> >>> > Using the default settings, streaming uses TextInputFormat which
> creates
> >>> > one
> >>> > record from each line.  The problem I am having is that I need record
> >>> > boundaries to be every 4 lines.  When the splitter breaks up the
> input
> >>> > into
> >>> > the mappers, I have partial records on the boundaries due to this.
>  To
> >>> > address this, my approach was to write a new RecordReader class
> almost
> >>> > in
> >>> > java that is almost identical to LineRecordReader, but with a
> modified
> >>> > next() method that reads 4 lines instead of one.
> >>> >
> >>> > I then compiled the new class and created a jar.  I wanted to import
> >>> > this at
> >>> > run time using the -libjars argument, like such:
> >>> >
> >>> > hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars
> >>> > NLineRecordReader.jar -files test_stream.sh -inputreader
> >>> > mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt
> -output
> >>> > /Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE
> >>> >
> >>> > Unfortunately, I keep getting the following error:
> >>> > -inputreader: class not found: mypackage.NLineRecordReader
> >>> >
> >>> > My question is 2 fold.  Am I using the right approach to handle the
4
> >>> > line
> >>> > records with the custom RecordReader implementation?  And why isn't
> >>> > -libjars
> >>> > working to include my class to hadoop streaming at runtime?
> >>> >
> >>> > Thanks,
> >>> > Jason
> >>>
> >>>
> >>>
> >>> --
> >>> Harsh J
> >>
> >>
> >
> >
> >
> > --
> > Harsh J
>
>
>
> --
> Harsh J
>

Mime
View raw message