Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of airbots@gmail.com designates
 209.85.212.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAOcnVr0V9R86ASYZyjThygt5U9nfzRF=aSKS8G-am8t9Pq6y-A@mail.gmail.com>
References: 
 <CAGparvVQoXfaGW8Vy6E3+hWzHmFgpNUZG45LK9+rOLMs6DAs_g@mail.gmail.com>
	<CAOcnVr0V9R86ASYZyjThygt5U9nfzRF=aSKS8G-am8t9Pq6y-A@mail.gmail.com>
Date: Wed, 29 Aug 2012 02:57:18 -0500
Message-ID: 
 <CAGparvX2F51Jgfp+mO-yi5nEz3Gqzb1zo-CMqcEZ8iDKs_FOSQ@mail.gmail.com>
Subject: Re: Custom InputFormat errer
From: Chen He <airbots@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=bcaec5015f2b71361c04c862e812

--bcaec5015f2b71361c04c862e812
Content-Type: text/plain; charset=ISO-8859-1

Hi Harsh

Thank you for your reply. Do you mean I need to change the FileSplit to
avoid those errors I mentioned happen?

Regards!

Chen

On Wed, Aug 29, 2012 at 2:46 AM, Harsh J <harsh@cloudera.com> wrote:

> Hi Chen,
>
> Does your record reader and mapper handle the case where one map split
> may not exactly get the whole record? Your case is not very different
> from the newlines logic presented here:
> http://wiki.apache.org/hadoop/HadoopMapReduce
>
> On Wed, Aug 29, 2012 at 11:13 AM, Chen He <airbots@gmail.com> wrote:
> > Hi guys
> >
> > I met a interesting problem when I implement my own custom InputFormat
> which
> > extends the FileInputFormat.(I rewrite the RecordReader class but not the
> > InputSplit class)
> >
> > My recordreader will take following format as a basic record: (my
> > recordreader extends the LineRecordReader. It returns a record if it
> meets
> > #Trailer# and contains #Header#. I only have one input file that is
> composed
> > of many of following basic record)
> >
> > #Header#
> > .....(many lines, may be 0 lines or 1000 lines, it varies)
> > #Trailer#
> >
> > Everything works fine if above basic input unit in a file is integer
> times
> > of mapper. For example, I use 2 mappers and there are two basic records
> in
> > my input file. Or I use 3 mappers and there are 6 basic units in the
> input
> > file.
> >
> > However, if I use 4 mappers but there are 3 basic units in the input
> > file(not integer times). The final output is incorrect. The "Map Input
> > Bytes" in the job counter is also less than the input file size. How can
> I
> > fix it? Do I need to rewrite the inputSplit?
> >
> > Any reply will be appreciated!
> >
> > Regards!
> >
> > Chen
>
>
>
> --
> Harsh J
>

--bcaec5015f2b71361c04c862e812
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div>Hi Harsh<br></div><div><br></div><div>Thank you for your reply. Do you=
 mean I need to change the FileSplit to avoid those errors I mentioned happ=
en?</div><div><br></div><div>Regards!</div><div><br></div><div>Chen</div>
<br><div class=3D"gmail_quote">On Wed, Aug 29, 2012 at 2:46 AM, Harsh J <sp=
an dir=3D"ltr">&lt;<a href=3D"mailto:harsh@cloudera.com" target=3D"_blank">=
harsh@cloudera.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quot=
e" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hi Chen,<br>
<br>
Does your record reader and mapper handle the case where one map split<br>
may not exactly get the whole record? Your case is not very different<br>
from the newlines logic presented here:<br>
<a href=3D"http://wiki.apache.org/hadoop/HadoopMapReduce" target=3D"_blank"=
>http://wiki.apache.org/hadoop/HadoopMapReduce</a><br>
<div class=3D"HOEnZb"><div class=3D"h5"><br>
On Wed, Aug 29, 2012 at 11:13 AM, Chen He &lt;<a href=3D"mailto:airbots@gma=
il.com">airbots@gmail.com</a>&gt; wrote:<br>
&gt; Hi guys<br>
&gt;<br>
&gt; I met a interesting problem when I implement my own custom InputFormat=
 which<br>
&gt; extends the FileInputFormat.(I rewrite the RecordReader class but not =
the<br>
&gt; InputSplit class)<br>
&gt;<br>
&gt; My recordreader will take following format as a basic record: (my<br>
&gt; recordreader extends the LineRecordReader. It returns a record if it m=
eets<br>
&gt; #Trailer# and contains #Header#. I only have one input file that is co=
mposed<br>
&gt; of many of following basic record)<br>
&gt;<br>
&gt; #Header#<br>
&gt; .....(many lines, may be 0 lines or 1000 lines, it varies)<br>
&gt; #Trailer#<br>
&gt;<br>
&gt; Everything works fine if above basic input unit in a file is integer t=
imes<br>
&gt; of mapper. For example, I use 2 mappers and there are two basic record=
s in<br>
&gt; my input file. Or I use 3 mappers and there are 6 basic units in the i=
nput<br>
&gt; file.<br>
&gt;<br>
&gt; However, if I use 4 mappers but there are 3 basic units in the input<b=
r>
&gt; file(not integer times). The final output is incorrect. The &quot;Map =
Input<br>
&gt; Bytes&quot; in the job counter is also less than the input file size. =
How can I<br>
&gt; fix it? Do I need to rewrite the inputSplit?<br>
&gt;<br>
&gt; Any reply will be appreciated!<br>
&gt;<br>
&gt; Regards!<br>
&gt;<br>
&gt; Chen<br>
<br>
<br>
<br>
</div></div><span class=3D"HOEnZb"><font color=3D"#888888">--<br>
Harsh J<br>
</font></span></blockquote></div><br>

--bcaec5015f2b71361c04c862e812--