Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of jason.j.wang@gmail.com
 designates 209.85.216.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAOcnVr04vUiRyHHyyD=Tm9Uvkm2Ad9TkKMFqRT2fT46-rvvXdg@mail.gmail.com>
References: 
 <CAJNwycQnNV1BPLyYaT8YNUpMuaHGG4cXg5dx_N3Ybn4pB2SVXQ@mail.gmail.com>
	<CAOcnVr04vUiRyHHyyD=Tm9Uvkm2Ad9TkKMFqRT2fT46-rvvXdg@mail.gmail.com>
Date: Thu, 18 Oct 2012 00:24:44 -0500
Message-ID: 
 <CAJNwycS7Kzm1oH98505KzuuBf-A+ThxR=8Dr_cCwtXrS1djUOw@mail.gmail.com>
Subject: Re: hadoop streaming with custom RecordReader class
From: Jason Wang <jason.j.wang@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=20cf300fb263d55be804cc4e9a83

--20cf300fb263d55be804cc4e9a83
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

1. I did try using NLineInputFormat, but this causes the
"stream.map.input.ignoreKey" to no longer work.  As per the streaming
documentation:

"The configuration parameter is valid only if stream.map.input.writer.class
is org.apache.hadoop.streaming.io.TextInputWriter.class."

My mapper prefers the streaming stdin to not have the key as part of the
input.  I could obviously parse that out in the mapper, but the mapper
belongs to a 3rd party. This is why I tried to do the RecordReader route.

2. Yes - I did export the classpath before running.

3. This may be the problem:

bash-3.2$ jar -tf NLineRecordReader.jar
META-INF/
META-INF/MANIFEST.MF
NLineRecordReader.class

I have specified "package mypackage;" at the top of the java file though.
Then compiled using "javac" and then "jar cf".

4. The class is public.


On Wed, Oct 17, 2012 at 11:53 PM, Harsh J <harsh@cloudera.com> wrote:

> Hi Jason,
>
> A few questions (in order):
>
> 1. Does Hadoop's own NLineInputFormat not suffice?
>
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NL=
ineInputFormat.html
>
> 2. Do you make sure to pass your jar into the front-end too?
>
> $ export HADOOP_CLASSPATH=3D/path/to/your/jar
> $ command=85
>
> 3. Does jar -tf <yourjar> carry a proper mypackage.NLineRecordReader?
>
> 4. Is your class marked public?
>
> On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang <jason.j.wang@gmail.com>
> wrote:
> > Hi all,
> > I'm experimenting with hadoop streaming on build 1.0.3.
> >
> > To give background info, i'm streaming a text file into mapper written
> in C.
> > Using the default settings, streaming uses TextInputFormat which create=
s
> one
> > record from each line.  The problem I am having is that I need record
> > boundaries to be every 4 lines.  When the splitter breaks up the input
> into
> > the mappers, I have partial records on the boundaries due to this.  To
> > address this, my approach was to write a new RecordReader class almost =
in
> > java that is almost identical to LineRecordReader, but with a modified
> > next() method that reads 4 lines instead of one.
> >
> > I then compiled the new class and created a jar.  I wanted to import
> this at
> > run time using the -libjars argument, like such:
> >
> > hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars
> > NLineRecordReader.jar -files test_stream.sh -inputreader
> > mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output
> > /Users/hadoop/test/output -mapper =93test_stream.sh=94 -reducer NONE
> >
> > Unfortunately, I keep getting the following error:
> > -inputreader: class not found: mypackage.NLineRecordReader
> >
> > My question is 2 fold.  Am I using the right approach to handle the 4
> line
> > records with the custom RecordReader implementation?  And why isn't
> -libjars
> > working to include my class to hadoop streaming at runtime?
> >
> > Thanks,
> > Jason
>
>
>
> --
> Harsh J
>

--20cf300fb263d55be804cc4e9a83
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

1. I did try using NLineInputFormat, but this causes the &quot;stream.map.i=
nput.ignoreKey&quot; to no longer work. =A0As per the streaming documentati=
on:<div><br></div><div><font face=3D"arial, helvetica, sans-serif">&quot;<s=
pan style=3D"font-size:13.333333969116211px;background-color:rgb(255,255,25=
5)">The configuration parameter is valid only if stream.map.input.writer.cl=
ass is org.apache.hadoop.streaming.io.TextInputWriter.class.&quot;</span></=
font></div>
<div><font face=3D"arial, helvetica, sans-serif"><br></font></div><div><fon=
t face=3D"arial, helvetica, sans-serif">My mapper prefers the streaming std=
in to not have the key as part of the input. =A0I could obviously parse tha=
t out in the mapper, but the mapper belongs to a 3rd party. This is why I t=
ried to do the RecordReader route.</font></div>
<div><font face=3D"arial, helvetica, sans-serif"><br></font></div><div><fon=
t face=3D"arial, helvetica, sans-serif">2. Yes - I did export the classpath=
 before running.</font></div><div><font face=3D"arial, helvetica, sans-seri=
f"><br>
</font></div><div><font face=3D"arial, helvetica, sans-serif">3. This may b=
e the problem:</font></div><div><font face=3D"arial, helvetica, sans-serif"=
><br></font></div><div><font face=3D"arial, helvetica, sans-serif"><div>bas=
h-3.2$ jar -tf NLineRecordReader.jar=A0</div>
<div>META-INF/</div><div>META-INF/MANIFEST.MF</div><div>NLineRecordReader.c=
lass</div><div><br></div><div>I have specified &quot;package mypackage;&quo=
t; at the top of the java file though. Then compiled using &quot;javac&quot=
; and then &quot;jar cf&quot;.</div>
<div><br></div><div>4. The class is public.</div></font></div><div><font fa=
ce=3D"Verdana, Helvetica, sans-serif"><br></font></div><div><font face=3D"V=
erdana, Helvetica, sans-serif"><br></font><br><div class=3D"gmail_quote">On=
 Wed, Oct 17, 2012 at 11:53 PM, Harsh J <span dir=3D"ltr">&lt;<a href=3D"ma=
ilto:harsh@cloudera.com" target=3D"_blank">harsh@cloudera.com</a>&gt;</span=
> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hi Jason,<br>
<br>
A few questions (in order):<br>
<br>
1. Does Hadoop&#39;s own NLineInputFormat not suffice?<br>
<a href=3D"http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapr=
ed/lib/NLineInputFormat.html" target=3D"_blank">http://hadoop.apache.org/do=
cs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html</a><br>
<br>
2. Do you make sure to pass your jar into the front-end too?<br>
<br>
$ export HADOOP_CLASSPATH=3D/path/to/your/jar<br>
$ command=85<br>
<br>
3. Does jar -tf &lt;yourjar&gt; carry a proper mypackage.NLineRecordReader?=
<br>
<br>
4. Is your class marked public?<br>
<div class=3D"HOEnZb"><div class=3D"h5"><br>
On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang &lt;<a href=3D"mailto:jason.j.w=
ang@gmail.com">jason.j.wang@gmail.com</a>&gt; wrote:<br>
&gt; Hi all,<br>
&gt; I&#39;m experimenting with hadoop streaming on build 1.0.3.<br>
&gt;<br>
&gt; To give background info, i&#39;m streaming a text file into mapper wri=
tten in C.<br>
&gt; Using the default settings, streaming uses TextInputFormat which creat=
es one<br>
&gt; record from each line. =A0The problem I am having is that I need recor=
d<br>
&gt; boundaries to be every 4 lines. =A0When the splitter breaks up the inp=
ut into<br>
&gt; the mappers, I have partial records on the boundaries due to this. =A0=
To<br>
&gt; address this, my approach was to write a new RecordReader class almost=
 in<br>
&gt; java that is almost identical to LineRecordReader, but with a modified=
<br>
&gt; next() method that reads 4 lines instead of one.<br>
&gt;<br>
&gt; I then compiled the new class and created a jar. =A0I wanted to import=
 this at<br>
&gt; run time using the -libjars argument, like such:<br>
&gt;<br>
&gt; hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars<br=
>
&gt; NLineRecordReader.jar -files test_stream.sh -inputreader<br>
&gt; mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output=
<br>
&gt; /Users/hadoop/test/output -mapper =93test_stream.sh=94 -reducer NONE<b=
r>
&gt;<br>
&gt; Unfortunately, I keep getting the following error:<br>
&gt; -inputreader: class not found: mypackage.NLineRecordReader<br>
&gt;<br>
&gt; My question is 2 fold. =A0Am I using the right approach to handle the =
4 line<br>
&gt; records with the custom RecordReader implementation? =A0And why isn=
9;t -libjars<br>
&gt; working to include my class to hadoop streaming at runtime?<br>
&gt;<br>
&gt; Thanks,<br>
&gt; Jason<br>
<br>
<br>
<br>
</div></div><span class=3D"HOEnZb"><font color=3D"#888888">--<br>
Harsh J<br>
</font></span></blockquote></div><br></div>

--20cf300fb263d55be804cc4e9a83--