Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of azuryyyu@gmail.com designates
 209.85.212.45 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <CEB4FC9C.17AD0%zoraida@tid.es>
References: 
 <CALr1C9ocKk5wgL0bQw1OUaPzUFR6AcchfJOAOpBENd_=c-dbQA@mail.gmail.com>
	<CEB4FC9C.17AD0%zoraida@tid.es>
Date: Sat, 23 Nov 2013 11:30:28 +0800
Message-ID: 
 <CALr1C9pDjYuHBeY44n-PBi23_PFLHwmDv5ydm3qNPk4TR-sf1A@mail.gmail.com>
Subject: Re: Missing records from HDFS
From: Azuryy Yu <azuryyyu@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=089e0111ae348faf6604ebcfc0e0

--089e0111ae348faf6604ebcfc0e0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

There is problem in the 'initialize', generally, we cannot think
split.start as the real start, because FileSplit cannot split on the end of
the line accurately, so  you need to adjust the start in the 'initialize'
to the start of one line if start is not equal to '0'.

also, end =3D start + split.length, this is not a real end, because it mayb=
e
not locate the end of the line.

so the Reader MUST adjust the real start and the end in the 'initialize'.
otherwise, maybe miss some records.

 Sure,

 our FileInputFormat implementation:

  public class CVSInputFormat extends

        FileInputFormat<FileValidatorDescriptor, Text> {


     /*

     * (non-Javadoc)

     *

     * @see

     * org.apache.hadoop.mapreduce.InputFormat#createRecordReader(org.apach=
e

     * .hadoop.mapreduce.InputSplit,

     * org.apache.hadoop.mapreduce.TaskAttemptContext)

     */

    @Override

    public RecordReader<FileValidatorDescriptor, Text> createRecordReader(

            InputSplit split, TaskAttemptContext context) {

        String delimiter =3D context.getConfiguration().get(

                "textinputformat.record.delimiter");

        byte[] recordDelimiterBytes =3D null;

        if (null !=3D delimiter)

            recordDelimiterBytes =3D delimiter.getBytes();

        return new CVSLineRecordReader(recordDelimiterBytes);

    }


     /*

     * (non-Javadoc)

     *

     * @see

     * org.apache.hadoop.mapreduce.lib.input.FileInputFormat#isSplitable(or=
g

     * .apache.hadoop.mapreduce.JobContext, org.apache.hadoop.fs.Path)

     */

    @Override

    protected boolean isSplitable(JobContext context, Path file) {

        CompressionCodec codec =3D new CompressionCodecFactory(

                context.getConfiguration()).getCodec(file);

        return codec =3D=3D null;

    }

}


 the recordReader:


 public class CVSLineRecordReader extends

        RecordReader<FileValidatorDescriptor, Text> {

    private static final Log LOG =3D
LogFactory.getLog(CVSLineRecordReader.class);


     public static final String CVS_FIRST_LINE =3D "file.first.line";


     private long start;

    private long pos;

    private long end;

    private LineReader in;

    private int maxLineLength;

    private FileValidatorDescriptor key =3D null;

    private Text value =3D null;

    private Text data =3D null;

    private byte[] recordDelimiterBytes;


     public CVSLineRecordReader(byte[] recordDelimiter) {

        this.recordDelimiterBytes =3D recordDelimiter;

    }


     @Override

    public void initialize(InputSplit genericSplit, TaskAttemptContext
context)

            throws IOException {

        Properties properties =3D new Properties();

        Configuration configuration =3D context.getConfiguration();


         Path[] cacheFiles =3D DistributedCache.getLocalCacheFiles(context

                .getConfiguration());

        for (Path cacheFile : cacheFiles) {

            if (cacheFile.toString().endsWith(

                    context.getConfiguration().get(VALIDATOR_CONF_PATH))) {

                properties.load(new FileReader(cacheFile.toString()));

            }

        }


         FileSplit split =3D (FileSplit) genericSplit;

        Configuration job =3D context.getConfiguration();

        this.maxLineLength =3D job.getInt("mapred.linerecordreader.maxlengt=
h",

                Integer.MAX_VALUE);

        start =3D split.getStart();

        end =3D start + split.getLength();

        pos =3D start;

        final Path file =3D split.getPath();


         // open the file and seek to the start of the split

        FileSystem fs =3D file.getFileSystem(job);

        FSDataInputStream fileIn =3D fs.open(split.getPath());


         this.in =3D generateReader(fileIn, job);


         // if CVS_FIRST_LINE does not exist in conf then the csv file
first line

        // is the header

        if (properties.containsKey(CVS_FIRST_LINE)) {

            configuration.set(CVS_FIRST_LINE, properties.get(CVS_FIRST_LINE=
)

                    .toString());

        } else {

            readData();

            configuration.set(CVS_FIRST_LINE, data.toString());

            if (start !=3D 0) {

                fileIn.seek(start);

                in =3D generateReader(fileIn, job);

                pos =3D start;

            }

        }


         key =3D new FileValidatorDescriptor();

        key.setFileName(split.getPath().getName());

        context.getConfiguration().set("file.name", key.getFileName());


     }


     @Override

    public boolean nextKeyValue() throws IOException {

        int newSize =3D readData();

        if (newSize =3D=3D 0) {

            key =3D null;

            value =3D null;

            return false;

        } else {

            key.setOffset(this.pos);

            value =3D data;

            return true;

        }

    }


     private LineReader generateReader(FSDataInputStream fileIn,

            Configuration job) throws IOException {

        if (null =3D=3D this.recordDelimiterBytes) {

            return new LineReader(fileIn, job);

        } else {

            return new LineReader(fileIn, job, this.recordDelimiterBytes);

        }


     }


     private int readData() throws IOException {

        if (data =3D=3D null) {

            data =3D new Text();

        }

        int newSize =3D 0;

        while (pos < end) {

            newSize =3D in.readLine(data, maxLineLength,

                    Math.max((int) Math.min(Integer.MAX_VALUE, end - pos),

                            maxLineLength));

            if (newSize =3D=3D 0) {

                break;

            }

            pos +=3D newSize;

            if (newSize < maxLineLength) {

                break;

            }


             // line too long. try again

            LOG.info("Skipped line of size " + newSize + " at pos "

                    + (pos - newSize));

        }

        return newSize;

    }


     @Override

    public FileValidatorDescriptor getCurrentKey() {

        return key;

    }


     @Override

    public Text getCurrentValue() {

        return value;

    }


     @Override

    public float getProgress() {

        if (start =3D=3D end) {

            return 0.0f;

        } else {

            return Math.min(1.0f, (pos - start) / (float) (end - start));

        }

    }


     @Override

    public synchronized void close() throws IOException {

        if (in !=3D null) {

            in.close();

        }

    }

}

 Thanks.


  De: Azuryy Yu <azuryyyu@gmail.com>
Responder a: "user@hadoop.apache.org" <user@hadoop.apache.org>
Fecha: viernes, 22 de noviembre de 2013 12:19
Para: "user@hadoop.apache.org" <user@hadoop.apache.org>
Asunto: Re: Missing records from HDFS

  I do think this is because of your RecorderReader, can you paste your
code here? and give a piece of data example.

 please use pastebin if you want.


On Fri, Nov 22, 2013 at 7:16 PM, ZORAIDA HIDALGO SANCHEZ <zoraida@tid.es>wr=
ote:

>  One more thing,
>
>  if we split the files then all the records are processed. Files are
> of 70,5MB.
>
>  Thanks,
>
>  Zoraida.-
>
>   De: zoraida <zoraida@tid.es>
> Fecha: viernes, 22 de noviembre de 2013 08:59
>
> Para: "user@hadoop.apache.org" <user@hadoop.apache.org>
> Asunto: Re: Missing records from HDFS
>
>   Thanks for your response Azuryy.
>
>  My hadoop version: 2.0.0-cdh4.3.0
> InputFormat: a custom class that extends from FileInputFormat(csv input
> format)
> These fiels are under the same directory, different files.
> My input path is configured using oozie throughout the propertie
> mapred.input.dir.
>
>
>  Same code and input running on Hadoop 2.0.0-cdh4.2.1 works fine. Does
> not discard any record.
>
>  Thanks.
>
>   De: Azuryy Yu <azuryyyu@gmail.com>
> Responder a: "user@hadoop.apache.org" <user@hadoop.apache.org>
> Fecha: jueves, 21 de noviembre de 2013 07:31
> Para: "user@hadoop.apache.org" <user@hadoop.apache.org>
> Asunto: Re: Missing records from HDFS
>
>   what's your hadoop version? and which InputFormat are you used?
>
>  these files under one directory or there are lots of subdirectory? how
> ddi you configure input path in your main?
>
>
>
> On Thu, Nov 21, 2013 at 12:25 AM, ZORAIDA HIDALGO SANCHEZ <zoraida@tid.es=
>wrote:
>
>>  Hi all,
>>
>>  my job is not reading all the input records. In the input directory I
>> have a set of files containing a total of 6000000 records but only 59970=
00
>> are processed. The Map Input Records counter says 5997000.
>> I have tried downloading the files with a getmerge to check how many
>> records would return but the correct number is returned(6000000).
>>
>>  Do you have any suggestion?
>>
>>  Thanks.
>>
>> ------------------------------
>>
>> Este mensaje se dirige exclusivamente a su destinatario. Puede consultar
>> nuestra pol=EDtica de env=EDo y recepci=F3n de correo electr=F3nico en e=
l enlace
>> situado m=E1s abajo.
>> This message is intended exclusively for its addressee. We only send and
>> receive email on the basis of the terms set out at:
>> http://www.tid.es/ES/PAGINAS/disclaimer.aspx
>>
>
>
> ------------------------------
>
> Este mensaje se dirige exclusivamente a su destinatario. Puede consultar
> nuestra pol=EDtica de env=EDo y recepci=F3n de correo electr=F3nico en el=
 enlace
> situado m=E1s abajo.
> This message is intended exclusively for its addressee. We only send and
> receive email on the basis of the terms set out at:
> http://www.tid.es/ES/PAGINAS/disclaimer.aspx
>


------------------------------

Este mensaje se dirige exclusivamente a su destinatario. Puede consultar
nuestra pol=EDtica de env=EDo y recepci=F3n de correo electr=F3nico en el e=
nlace
situado m=E1s abajo.
This message is intended exclusively for its addressee. We only send and
receive email on the basis of the terms set out at:
http://www.tid.es/ES/PAGINAS/disclaimer.aspx

--089e0111ae348faf6604ebcfc0e0
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<p dir=3D"ltr">There is problem in the &#39;initialize&#39;, generally, we =
cannot think split.start as the real start, because FileSplit cannot split =
on the end of the line accurately, so=A0 you need to adjust the start in th=
e &#39;initialize&#39; to the start of one line if start is not equal to &#=
39;0&#39;. </p>

<p dir=3D"ltr">also, end =3D start + split.length, this is not a real end, =
because it maybe not locate the end of the line.</p>
<p dir=3D"ltr">so the Reader MUST adjust the real start and the end in the =
&#39;initialize&#39;. otherwise, maybe miss some records.<br><br></p>
<div class=3D"gmail_quot&lt;blockquote class=3D" style=3D"margin:0 0 0 .8ex=
;border-left:1px #ccc solid;padding-left:1ex">


<div style=3D"font-size:14px;font-family:Calibri,sans-serif;word-wrap:break=
-word">
<div>Sure,</div>
<div><br>
</div>
<div>our FileInputFormat implementation:</div>
<div><br>
</div>
<div>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco;color:rgb(147,26,1=
04)">
public<span style=3D"color:#000000"> </span>class<span style=3D"color:#0000=
00"> CVSInputFormat
</span>extends</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco">=A0 =A0 =A0 =A0 F=
ileInputFormat&lt;FileValidatorDescriptor, Text&gt; {</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco;min-height:15px"><=
br>
</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco">=A0 =A0 <span sty=
le=3D"color:#4e9072">
/*</span></p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco;color:rgb(78,144,1=
14)">
=A0=A0 =A0 * (non-<span style=3D"text-decoration:underline">Javadoc</span>)=
</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco;color:rgb(78,144,1=
14)">
=A0=A0 =A0 *=A0</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco;color:rgb(78,144,1=
14)">
=A0=A0 =A0 * @see</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco;color:rgb(78,144,1=
14)">
=A0=A0 =A0 * org.apache.hadoop.mapreduce.InputFormat#createRecordReader(org=
.apache</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco;color:rgb(78,144,1=
14)">
=A0=A0 =A0 * .hadoop.mapreduce.InputSplit,</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco;color:rgb(78,144,1=
14)">
=A0=A0 =A0 * org.apache.hadoop.mapreduce.TaskAttemptContext)</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco;color:rgb(78,144,1=
14)">
=A0=A0 =A0 */</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco;color:rgb(119,119,=
119)">
<span style=3D"color:#000000">=A0 =A0 </span>@Override</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco">=A0 =A0 <span sty=
le=3D"color:#931a68">
public</span> RecordReader&lt;FileValidatorDescriptor, Text&gt; createRecor=
dReader(</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco">=A0 =A0 =A0 =A0 =
=A0 =A0 InputSplit split, TaskAttemptContext context) {</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco">=A0 =A0 =A0 =A0 S=
tring delimiter =3D context.getConfiguration().get(</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco;color:rgb(57,51,25=
5)"><span style=3D"color:#000000">=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0
</span>&quot;textinputformat.record.delimiter&quot;<span style=3D"color:#00=
0000">);</span></p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco">=A0 =A0 =A0 =A0 <=
span style=3D"color:#931a68">
byte</span>[] recordDelimiterBytes =3D <span style=3D"color:#931a68">null</=
span>;</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco">=A0 =A0 =A0 =A0 <=
span style=3D"color:#931a68">
if</span> (<span style=3D"color:#931a68">null</span> !=3D delimiter)</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco">=A0 =A0 =A0 =A0 =
=A0 =A0 recordDelimiterBytes =3D delimiter.getBytes();</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco">=A0 =A0 =A0 =A0 <=
span style=3D"color:#931a68">
return</span> <span style=3D"color:#931a68">new</span> CVSLineRecordReader(=
recordDelimiterBytes);</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco">=A0 =A0 }</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco;min-height:15px"><=
br>
</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco">=A0 =A0 <span sty=
le=3D"color:#4e9072">
/*</span></p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco;color:rgb(78,144,1=
14)">
=A0=A0 =A0 * (non-<span style=3D"text-decoration:underline">Javadoc</span>)=
</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco;color:rgb(78,144,1=
14)">
=A0=A0 =A0 *=A0</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco;color:rgb(78,144,1=
14)">
=A0=A0 =A0 * @see</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco;color:rgb(78,144,1=
14)">
=A0=A0 =A0 * org.apache.hadoop.mapreduce.lib.input.FileInputFormat#isSplita=
ble(org</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco;color:rgb(78,144,1=
14)">
=A0=A0 =A0 * .apache.hadoop.mapreduce.JobContext, org.apache.hadoop.fs.Path=
)</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco;color:rgb(78,144,1=
14)">
=A0=A0 =A0 */</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco;color:rgb(119,119,=
119)">
<span style=3D"color:#000000">=A0 =A0 </span>@Override</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco">=A0 =A0 <span sty=
le=3D"color:#931a68">
protected</span> <span style=3D"color:#931a68">boolean</span> isSplitable(J=
obContext context, Path file) {</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco">=A0 =A0 =A0 =A0 C=
ompressionCodec codec =3D
<span style=3D"color:#931a68">new</span> CompressionCodecFactory(</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco">=A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 context.getConfiguration()).getCodec(file);</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco">=A0 =A0 =A0 =A0 <=
span style=3D"color:#931a68">
return</span> codec =3D=3D <span style=3D"color:#931a68">null</span>;</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco">=A0 =A0 }</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco">}</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco"><br>
</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco">the recordReader:=
</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco"><br>
</p>
<p style=3D"margin:0px;font-size:11px;font-family:Monaco"></p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">public class CVSLineRecordReader extends</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 RecordReader&lt;FileValidatorDescri=
ptor, Text&gt; {</p>
<p style=3D"margin:0px">=A0 =A0 private static final Log LOG =3D LogFactory=
.getLog(CVSLineRecordReader.class);</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">=A0 =A0 public static final String CVS_FIRST_LINE =
=3D &quot;file.first.line&quot;;</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">=A0 =A0 private long start;</p>
<p style=3D"margin:0px">=A0 =A0 private long pos;</p>
<p style=3D"margin:0px">=A0 =A0 private long end;</p>
<p style=3D"margin:0px">=A0 =A0 private LineReader in;</p>
<p style=3D"margin:0px">=A0 =A0 private int maxLineLength;</p>
<p style=3D"margin:0px">=A0 =A0 private FileValidatorDescriptor key =3D nul=
l;</p>
<p style=3D"margin:0px">=A0 =A0 private Text value =3D null;</p>
<p style=3D"margin:0px">=A0 =A0 private Text data =3D null;</p>
<p style=3D"margin:0px">=A0 =A0 private byte[] recordDelimiterBytes;</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">=A0 =A0 public CVSLineRecordReader(byte[] recordDel=
imiter) {</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 this.recordDelimiterBytes =3D recor=
dDelimiter;</p>
<p style=3D"margin:0px">=A0 =A0 }</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">=A0 =A0 @Override</p>
<p style=3D"margin:0px">=A0 =A0 public void initialize(InputSplit genericSp=
lit, TaskAttemptContext context)</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 throws IOException {</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 Properties properties =3D new Prope=
rties();</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 Configuration configuration =3D con=
text.getConfiguration();</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 Path[] cacheFiles =3D DistributedCa=
che.getLocalCacheFiles(context</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 .getConfiguration()=
);</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 for (Path cacheFile : cacheFiles) {=
</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 if (cacheFile.toString().en=
dsWith(</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 context.get=
Configuration().get(VALIDATOR_CONF_PATH))) {</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 properties.load(new=
 FileReader(cacheFile.toString()));</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 }</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 }</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 FileSplit split =3D (FileSplit) gen=
ericSplit;</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 Configuration job =3D context.getCo=
nfiguration();</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 this.maxLineLength =3D job.getInt(&=
quot;mapred.linerecordreader.maxlength&quot;,</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 Integer.MAX_VALUE);=
</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 start =3D split.getStart();</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 end =3D start + split.getLength();<=
/p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 pos =3D start;</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 final Path file =3D split.getPath()=
;</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 // open the file and seek to the st=
art of the split</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 FileSystem fs =3D file.getFileSyste=
m(job);</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 FSDataInputStream fileIn =3D fs.ope=
n(split.getPath());</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 <a href=3D"http://this.in" target=
=3D"_blank">this.in</a> =3D generateReader(fileIn, job);</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 // if CVS_FIRST_LINE does not exist=
 in conf then the csv file first line</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 // is the header</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 if (properties.containsKey(CVS_FIRS=
T_LINE)) {</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 configuration.set(CVS_FIRST=
_LINE, properties.get(CVS_FIRST_LINE)</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 .toString()=
);</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 } else {</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 readData();</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 configuration.set(CVS_FIRST=
_LINE, data.toString());</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 if (start !=3D 0) {</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 fileIn.seek(start);=
</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 in =3D generateRead=
er(fileIn, job);</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 pos =3D start;</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 }</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 }</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 key =3D new FileValidatorDescriptor=
();</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 key.setFileName(split.getPath().get=
Name());</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 context.getConfiguration().set(&quo=
t;<a href=3D"http://file.name" target=3D"_blank">file.name</a>&quot;, key.g=
etFileName());</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">=A0 =A0 }</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">=A0 =A0 @Override</p>
<p style=3D"margin:0px">=A0 =A0 public boolean nextKeyValue() throws IOExce=
ption {</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 int newSize =3D readData();</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 if (newSize =3D=3D 0) {</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 key =3D null;</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 value =3D null;</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 return false;</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 } else {</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 key.setOffset(this.pos);</p=
>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 value =3D data;</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 return true;</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 }</p>
<p style=3D"margin:0px">=A0 =A0 }</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">=A0 =A0 private LineReader generateReader(FSDataInp=
utStream fileIn,</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 Configuration job) throws I=
OException {</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 if (null =3D=3D this.recordDelimite=
rBytes) {</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 return new LineReader(fileI=
n, job);</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 } else {</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 return new LineReader(fileI=
n, job, this.recordDelimiterBytes);</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 }</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">=A0 =A0 }</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">=A0 =A0 private int readData() throws IOException {=
</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 if (data =3D=3D null) {</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 data =3D new Text();</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 }</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 int newSize =3D 0;</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 while (pos &lt; end) {</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 newSize =3D in.readLine(dat=
a, maxLineLength,</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 Math.max((i=
nt) Math.min(Integer.MAX_VALUE, end - pos),</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 maxLineLength));</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 if (newSize =3D=3D 0) {</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 }</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 pos +=3D newSize;</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 if (newSize &lt; maxLineLen=
gth) {</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 }</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 // line too long. try again=
</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 LOG.info(&quot;Skipped line=
 of size &quot; + newSize + &quot; at pos &quot;</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 + (pos - ne=
wSize));</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 }</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 return newSize;</p>
<p style=3D"margin:0px">=A0 =A0 }</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">=A0 =A0 @Override</p>
<p style=3D"margin:0px">=A0 =A0 public FileValidatorDescriptor getCurrentKe=
y() {</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 return key;</p>
<p style=3D"margin:0px">=A0 =A0 }</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">=A0 =A0 @Override</p>
<p style=3D"margin:0px">=A0 =A0 public Text getCurrentValue() {</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 return value;</p>
<p style=3D"margin:0px">=A0 =A0 }</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">=A0 =A0 @Override</p>
<p style=3D"margin:0px">=A0 =A0 public float getProgress() {</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 if (start =3D=3D end) {</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 return 0.0f;</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 } else {</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 return Math.min(1.0f, (pos =
- start) / (float) (end - start));</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 }</p>
<p style=3D"margin:0px">=A0 =A0 }</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">=A0 =A0 @Override</p>
<p style=3D"margin:0px">=A0 =A0 public synchronized void close() throws IOE=
xception {</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 if (in !=3D null) {</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 =A0 =A0 in.close();</p>
<p style=3D"margin:0px">=A0 =A0 =A0 =A0 }</p>
<p style=3D"margin:0px">=A0 =A0 }</p>
<p style=3D"margin:0px">}</p>
<div><br>
</div>
<div>Thanks.</div>
<p></p>
</div>
<div><br>
</div>
<span>
<div style=3D"border-right:medium none;padding-right:0in;padding-left:0in;p=
adding-top:3pt;text-align:left;font-size:11pt;border-bottom:medium none;fon=
t-family:Calibri;border-top:#b5c4df 1pt solid;padding-bottom:0in;border-lef=
t:medium none">

<span style=3D"font-weight:bold">De: </span>Azuryy Yu &lt;<a href=3D"mailto=
:azuryyyu@gmail.com" target=3D"_blank">azuryyyu@gmail.com</a>&gt;<br>
<span style=3D"font-weight:bold">Responder a: </span>&quot;<a href=3D"mailt=
o:user@hadoop.apache.org" target=3D"_blank">user@hadoop.apache.org</a>&quot=
; &lt;<a href=3D"mailto:user@hadoop.apache.org" target=3D"_blank">user@hado=
op.apache.org</a>&gt;<br>

<span style=3D"font-weight:bold">Fecha: </span>viernes, 22 de noviembre de =
2013 12:19<br>
<span style=3D"font-weight:bold">Para: </span>&quot;<a href=3D"mailto:user@=
hadoop.apache.org" target=3D"_blank">user@hadoop.apache.org</a>&quot; &lt;<=
a href=3D"mailto:user@hadoop.apache.org" target=3D"_blank">user@hadoop.apac=
he.org</a>&gt;<br>

<span style=3D"font-weight:bold">Asunto: </span>Re: Missing records from HD=
FS<br>
</div>
<div><br>
</div>
<div>
<div>
<div dir=3D"ltr">I do think this is because of your RecorderReader, can you=
 paste your code here? and give a piece of data example.
<div><br>
</div>
<div>please use pastebin if you want.=A0</div>
</div>
<div class=3D"gmail_extra"><br>
<br>
<div class=3D"gmail_quote">On Fri, Nov 22, 2013 at 7:16 PM, ZORAIDA HIDALGO=
 SANCHEZ
<span dir=3D"ltr">&lt;<a href=3D"mailto:zoraida@tid.es" target=3D"_blank">z=
oraida@tid.es</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div style=3D"font-size:14px;font-family:Calibri,sans-serif;word-wrap:break=
-word">
<div>One more thing,</div>
<div><br>
</div>
<div>if we split the files then all the records are processed. Files are of=
=A070,5MB.</div>
<div><br>
</div>
<div>Thanks,</div>
<div><br>
</div>
<div>Zoraida.-</div>
<div><br>
</div>
<span>
<div style=3D"border-right:medium none;padding-right:0in;padding-left:0in;p=
adding-top:3pt;text-align:left;font-size:11pt;border-bottom:medium none;fon=
t-family:Calibri;border-top:#b5c4df 1pt solid;padding-bottom:0in;border-lef=
t:medium none">

<span style=3D"font-weight:bold">De: </span>zoraida &lt;<a href=3D"mailto:z=
oraida@tid.es" target=3D"_blank">zoraida@tid.es</a>&gt;<br>
<span style=3D"font-weight:bold">Fecha: </span>viernes, 22 de noviembre de =
2013 08:59
<div>
<div><br>
<span style=3D"font-weight:bold">Para: </span>&quot;<a href=3D"mailto:user@=
hadoop.apache.org" target=3D"_blank">user@hadoop.apache.org</a>&quot; &lt;<=
a href=3D"mailto:user@hadoop.apache.org" target=3D"_blank">user@hadoop.apac=
he.org</a>&gt;<br>

<span style=3D"font-weight:bold">Asunto: </span>Re: Missing records from HD=
FS<br>
</div>
</div>
</div>
<div>
<div>
<div><br>
</div>
<div>
<div style=3D"word-wrap:break-word">
<div style=3D"font-size:14px;font-family:Calibri,sans-serif">Thanks for you=
r response Azuryy.</div>
<div style=3D"font-size:14px;font-family:Calibri,sans-serif"><br>
</div>
<div style=3D"font-size:14px;font-family:Calibri,sans-serif">My hadoop vers=
ion:=A02.0.0-cdh4.3.0</div>
<div style=3D"font-size:14px;font-family:Calibri,sans-serif">InputFormat: a=
 custom class that extends from=A0<span style=3D"font-family:Monaco;font-si=
ze:11px">FileInputFormat(csv input format)</span></div>
<div style=3D"font-size:14px;font-family:Calibri,sans-serif"><span style=3D=
"font-family:Monaco;font-size:11px">These fiels are under the same director=
y, different files.</span></div>
<div><font face=3D"Monaco">My input path is configured using oozie=A0throug=
hout=A0the=A0propertie=A0</font><span style=3D"font-family:Monaco;font-size=
:11px">mapred.input.dir.=A0</span></div>
<div><br>
</div>
<div style=3D"font-size:14px;font-family:Calibri,sans-serif"><span style=3D=
"font-family:Monaco;font-size:11px"><br>
</span></div>
<div style=3D"font-size:14px;font-family:Calibri,sans-serif"><span style=3D=
"font-family:Monaco;font-size:11px">Same code and input running on=A0</span=
>Hadoop 2.0.0-cdh4.2.1 works fine. Does not discard any record.</div>
<div style=3D"font-size:14px;font-family:Calibri,sans-serif"><br>
</div>
<div style=3D"font-size:14px;font-family:Calibri,sans-serif">Thanks.</div>
<div style=3D"font-size:14px;font-family:Calibri,sans-serif"><br>
</div>
<span style=3D"font-size:14px;font-family:Calibri,sans-serif">
<div style=3D"border-right:medium none;padding-right:0in;padding-left:0in;p=
adding-top:3pt;text-align:left;font-size:11pt;border-bottom:medium none;fon=
t-family:Calibri;border-top:#b5c4df 1pt solid;padding-bottom:0in;border-lef=
t:medium none">

<span style=3D"font-weight:bold">De: </span>Azuryy Yu &lt;<a href=3D"mailto=
:azuryyyu@gmail.com" target=3D"_blank">azuryyyu@gmail.com</a>&gt;<br>
<span style=3D"font-weight:bold">Responder a: </span>&quot;<a href=3D"mailt=
o:user@hadoop.apache.org" target=3D"_blank">user@hadoop.apache.org</a>&quot=
; &lt;<a href=3D"mailto:user@hadoop.apache.org" target=3D"_blank">user@hado=
op.apache.org</a>&gt;<br>

<span style=3D"font-weight:bold">Fecha: </span>jueves, 21 de noviembre de 2=
013 07:31<br>
<span style=3D"font-weight:bold">Para: </span>&quot;<a href=3D"mailto:user@=
hadoop.apache.org" target=3D"_blank">user@hadoop.apache.org</a>&quot; &lt;<=
a href=3D"mailto:user@hadoop.apache.org" target=3D"_blank">user@hadoop.apac=
he.org</a>&gt;<br>

<span style=3D"font-weight:bold">Asunto: </span>Re: Missing records from HD=
FS<br>
</div>
<div><br>
</div>
<div>
<div>
<div dir=3D"ltr">what&#39;s your hadoop version? and which InputFormat are =
you used?
<div><br>
</div>
<div>these files under one directory or there are lots of subdirectory? how=
 ddi you configure input path in your main?</div>
<div><br>
</div>
</div>
<div class=3D"gmail_extra"><br>
<br>
<div class=3D"gmail_quote">On Thu, Nov 21, 2013 at 12:25 AM, ZORAIDA HIDALG=
O SANCHEZ
<span dir=3D"ltr">&lt;<a href=3D"mailto:zoraida@tid.es" target=3D"_blank">z=
oraida@tid.es</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div style=3D"font-size:14px;font-family:Calibri,sans-serif;word-wrap:break=
-word">
<div>Hi all,</div>
<div><br>
</div>
<div>my job is not reading all the input records. In the input directory I =
have a set of files containing a total of 6000000 records but only 5997000 =
are processed. The=A0Map Input Records counter says=A05997000.</div>
<div>I have tried downloading the files with a getmerge to check how many r=
ecords would return but the correct number is returned(6000000).</div>
<div><br>
</div>
<div>Do you have any suggestion?=A0</div>
<div><br>
</div>
<div>Thanks.=A0</div>
<br>
<hr>
<font face=3D"Arial" color=3D"Gray" size=3D"1"><br>
Este mensaje se dirige exclusivamente a su destinatario. Puede consultar nu=
estra pol=EDtica de env=EDo y recepci=F3n de correo electr=F3nico en el enl=
ace situado m=E1s abajo.<br>
This message is intended exclusively for its addressee. We only send and re=
ceive email on the basis of the terms set out at:<br>
<a href=3D"http://www.tid.es/ES/PAGINAS/disclaimer.aspx" target=3D"_blank">=
http://www.tid.es/ES/PAGINAS/disclaimer.aspx</a><br>
</font></div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</span></div>
</div>
</div>
</div>
</span>
<div>
<div><br>
<hr>
<font face=3D"Arial" color=3D"Gray" size=3D"1"><br>
Este mensaje se dirige exclusivamente a su destinatario. Puede consultar nu=
estra pol=EDtica de env=EDo y recepci=F3n de correo electr=F3nico en el enl=
ace situado m=E1s abajo.<br>
This message is intended exclusively for its addressee. We only send and re=
ceive email on the basis of the terms set out at:<br>
<a href=3D"http://www.tid.es/ES/PAGINAS/disclaimer.aspx" target=3D"_blank">=
http://www.tid.es/ES/PAGINAS/disclaimer.aspx</a><br>
</font></div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</span><br>
<hr>
<font face=3D"Arial" color=3D"Gray" size=3D"1"><br>
Este mensaje se dirige exclusivamente a su destinatario. Puede consultar nu=
estra pol=EDtica de env=EDo y recepci=F3n de correo electr=F3nico en el enl=
ace situado m=E1s abajo.<br>
This message is intended exclusively for its addressee. We only send and re=
ceive email on the basis of the terms set out at:<br>
<a href=3D"http://www.tid.es/ES/PAGINAS/disclaimer.aspx" target=3D"_blank">=
http://www.tid.es/ES/PAGINAS/disclaimer.aspx</a><br>
</font>
</div>

</div>

--089e0111ae348faf6604ebcfc0e0--