Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of zoraida@tid.es designates
 195.235.93.200 as permitted sender)
Date: Fri, 22 Nov 2013 11:33:40 +0000
From: ZORAIDA HIDALGO SANCHEZ <zoraida@tid.es>
Subject: Re: Missing records from HDFS
In-reply-to: 
 <CALr1C9ocKk5wgL0bQw1OUaPzUFR6AcchfJOAOpBENd_=c-dbQA@mail.gmail.com>
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Message-id: <CEB4FC9C.17AD0%zoraida@tid.es>
MIME-version: 1.0
Content-type: multipart/alternative;
 boundary="Boundary_(ID_4ZpH6PowtawoNd+T/ba7Rw)"
Content-language: es-ES
Accept-Language: es-ES, en-US
Thread-topic: Missing records from HDFS
Thread-index: AQHO5g0WxThxnAl3aUG4c3xC9ACOC5ovKeWAgAG7mYCAADcBgP//8BOAgAAUswA=


--Boundary_(ID_4ZpH6PowtawoNd+T/ba7Rw)
Content-type: text/plain; charset=iso-8859-1
Content-transfer-encoding: quoted-printable

Sure,

our FileInputFormat implementation:


public class CVSInputFormat extends

        FileInputFormat<FileValidatorDescriptor, Text> {


    /*

     * (non-Javadoc)

     *

     * @see

     * org.apache.hadoop.mapreduce.InputFormat#createRecordReader(org.apach=
e

     * .hadoop.mapreduce.InputSplit,

     * org.apache.hadoop.mapreduce.TaskAttemptContext)

     */

    @Override

    public RecordReader<FileValidatorDescriptor, Text> createRecordReader(

            InputSplit split, TaskAttemptContext context) {

        String delimiter =3D context.getConfiguration().get(

                "textinputformat.record.delimiter");

        byte[] recordDelimiterBytes =3D null;

        if (null !=3D delimiter)

            recordDelimiterBytes =3D delimiter.getBytes();

        return new CVSLineRecordReader(recordDelimiterBytes);

    }


    /*

     * (non-Javadoc)

     *

     * @see

     * org.apache.hadoop.mapreduce.lib.input.FileInputFormat#isSplitable(or=
g

     * .apache.hadoop.mapreduce.JobContext, org.apache.hadoop.fs.Path)

     */

    @Override

    protected boolean isSplitable(JobContext context, Path file) {

        CompressionCodec codec =3D new CompressionCodecFactory(

                context.getConfiguration()).getCodec(file);

        return codec =3D=3D null;

    }

}


the recordReader:


public class CVSLineRecordReader extends

        RecordReader<FileValidatorDescriptor, Text> {

    private static final Log LOG =3D LogFactory.getLog(CVSLineRecordReader.=
class);


    public static final String CVS_FIRST_LINE =3D "file.first.line";


    private long start;

    private long pos;

    private long end;

    private LineReader in;

    private int maxLineLength;

    private FileValidatorDescriptor key =3D null;

    private Text value =3D null;

    private Text data =3D null;

    private byte[] recordDelimiterBytes;


    public CVSLineRecordReader(byte[] recordDelimiter) {

        this.recordDelimiterBytes =3D recordDelimiter;

    }


    @Override

    public void initialize(InputSplit genericSplit, TaskAttemptContext cont=
ext)

            throws IOException {

        Properties properties =3D new Properties();

        Configuration configuration =3D context.getConfiguration();


        Path[] cacheFiles =3D DistributedCache.getLocalCacheFiles(context

                .getConfiguration());

        for (Path cacheFile : cacheFiles) {

            if (cacheFile.toString().endsWith(

                    context.getConfiguration().get(VALIDATOR_CONF_PATH))) {

                properties.load(new FileReader(cacheFile.toString()));

            }

        }


        FileSplit split =3D (FileSplit) genericSplit;

        Configuration job =3D context.getConfiguration();

        this.maxLineLength =3D job.getInt("mapred.linerecordreader.maxlengt=
h",

                Integer.MAX_VALUE);

        start =3D split.getStart();

        end =3D start + split.getLength();

        pos =3D start;

        final Path file =3D split.getPath();


        // open the file and seek to the start of the split

        FileSystem fs =3D file.getFileSystem(job);

        FSDataInputStream fileIn =3D fs.open(split.getPath());


        this.in =3D generateReader(fileIn, job);


        // if CVS_FIRST_LINE does not exist in conf then the csv file first=
 line

        // is the header

        if (properties.containsKey(CVS_FIRST_LINE)) {

            configuration.set(CVS_FIRST_LINE, properties.get(CVS_FIRST_LINE=
)

                    .toString());

        } else {

            readData();

            configuration.set(CVS_FIRST_LINE, data.toString());

            if (start !=3D 0) {

                fileIn.seek(start);

                in =3D generateReader(fileIn, job);

                pos =3D start;

            }

        }


        key =3D new FileValidatorDescriptor();

        key.setFileName(split.getPath().getName());

        context.getConfiguration().set("file.name", key.getFileName());


    }


    @Override

    public boolean nextKeyValue() throws IOException {

        int newSize =3D readData();

        if (newSize =3D=3D 0) {

            key =3D null;

            value =3D null;

            return false;

        } else {

            key.setOffset(this.pos);

            value =3D data;

            return true;

        }

    }


    private LineReader generateReader(FSDataInputStream fileIn,

            Configuration job) throws IOException {

        if (null =3D=3D this.recordDelimiterBytes) {

            return new LineReader(fileIn, job);

        } else {

            return new LineReader(fileIn, job, this.recordDelimiterBytes);

        }


    }


    private int readData() throws IOException {

        if (data =3D=3D null) {

            data =3D new Text();

        }

        int newSize =3D 0;

        while (pos < end) {

            newSize =3D in.readLine(data, maxLineLength,

                    Math.max((int) Math.min(Integer.MAX_VALUE, end - pos),

                            maxLineLength));

            if (newSize =3D=3D 0) {

                break;

            }

            pos +=3D newSize;

            if (newSize < maxLineLength) {

                break;

            }


            // line too long. try again

            LOG.info("Skipped line of size " + newSize + " at pos "

                    + (pos - newSize));

        }

        return newSize;

    }


    @Override

    public FileValidatorDescriptor getCurrentKey() {

        return key;

    }


    @Override

    public Text getCurrentValue() {

        return value;

    }


    @Override

    public float getProgress() {

        if (start =3D=3D end) {

            return 0.0f;

        } else {

            return Math.min(1.0f, (pos - start) / (float) (end - start));

        }

    }


    @Override

    public synchronized void close() throws IOException {

        if (in !=3D null) {

            in.close();

        }

    }

}

Thanks.

De: Azuryy Yu <azuryyyu@gmail.com<mailto:azuryyyu@gmail.com>>
Responder a: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" <user@=
hadoop.apache.org<mailto:user@hadoop.apache.org>>
Fecha: viernes, 22 de noviembre de 2013 12:19
Para: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" <user@hadoop.=
apache.org<mailto:user@hadoop.apache.org>>
Asunto: Re: Missing records from HDFS

I do think this is because of your RecorderReader, can you paste your code =
here? and give a piece of data example.

please use pastebin if you want.


On Fri, Nov 22, 2013 at 7:16 PM, ZORAIDA HIDALGO SANCHEZ <zoraida@tid.es<ma=
ilto:zoraida@tid.es>> wrote:
One more thing,

if we split the files then all the records are processed. Files are of 70,5=
MB.

Thanks,

Zoraida.-

De: zoraida <zoraida@tid.es<mailto:zoraida@tid.es>>
Fecha: viernes, 22 de noviembre de 2013 08:59

Para: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" <user@hadoop.=
apache.org<mailto:user@hadoop.apache.org>>
Asunto: Re: Missing records from HDFS

Thanks for your response Azuryy.

My hadoop version: 2.0.0-cdh4.3.0
InputFormat: a custom class that extends from FileInputFormat(csv input for=
mat)
These fiels are under the same directory, different files.
My input path is configured using oozie throughout the propertie mapred.inp=
ut.dir.


Same code and input running on Hadoop 2.0.0-cdh4.2.1 works fine. Does not d=
iscard any record.

Thanks.

De: Azuryy Yu <azuryyyu@gmail.com<mailto:azuryyyu@gmail.com>>
Responder a: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" <user@=
hadoop.apache.org<mailto:user@hadoop.apache.org>>
Fecha: jueves, 21 de noviembre de 2013 07:31
Para: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" <user@hadoop.=
apache.org<mailto:user@hadoop.apache.org>>
Asunto: Re: Missing records from HDFS

what's your hadoop version? and which InputFormat are you used?

these files under one directory or there are lots of subdirectory? how ddi =
you configure input path in your main?


On Thu, Nov 21, 2013 at 12:25 AM, ZORAIDA HIDALGO SANCHEZ <zoraida@tid.es<m=
ailto:zoraida@tid.es>> wrote:
Hi all,

my job is not reading all the input records. In the input directory I have =
a set of files containing a total of 6000000 records but only 5997000 are p=
rocessed. The Map Input Records counter says 5997000.
I have tried downloading the files with a getmerge to check how many record=
s would return but the correct number is returned(6000000).

Do you have any suggestion?

Thanks.

________________________________

Este mensaje se dirige exclusivamente a su destinatario. Puede consultar nu=
estra pol=EDtica de env=EDo y recepci=F3n de correo electr=F3nico en el enl=
ace situado m=E1s abajo.
This message is intended exclusively for its addressee. We only send and re=
ceive email on the basis of the terms set out at:
http://www.tid.es/ES/PAGINAS/disclaimer.aspx


________________________________

Este mensaje se dirige exclusivamente a su destinatario. Puede consultar nu=
estra pol=EDtica de env=EDo y recepci=F3n de correo electr=F3nico en el enl=
ace situado m=E1s abajo.
This message is intended exclusively for its addressee. We only send and re=
ceive email on the basis of the terms set out at:
http://www.tid.es/ES/PAGINAS/disclaimer.aspx


________________________________

Este mensaje se dirige exclusivamente a su destinatario. Puede consultar nu=
estra pol=EDtica de env=EDo y recepci=F3n de correo electr=F3nico en el enl=
ace situado m=E1s abajo.
This message is intended exclusively for its addressee. We only send and re=
ceive email on the basis of the terms set out at:
http://www.tid.es/ES/PAGINAS/disclaimer.aspx

--Boundary_(ID_4ZpH6PowtawoNd+T/ba7Rw)
Content-id: <19A4A79283D6CB4C82817BC7EBD4439E@hi.inet>
Content-type: text/html; charset=iso-8859-1
Content-transfer-encoding: quoted-printable

<html>
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Diso-8859-=
1">
</head>
<body style=3D"word-wrap:break-word; color:rgb(0,0,0); font-size:14px; font=
-family:Calibri,sans-serif">
<div>Sure,</div>
<div><br>
</div>
<div>our FileInputFormat implementation:</div>
<div><br>
</div>
<div>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco; color:rgb(147,2=
6,104)">
public<span style=3D"color:#000000"> </span>class<span style=3D"color:#0000=
00"> CVSInputFormat
</span>extends</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco">&nbsp; &nbsp; &=
nbsp; &nbsp; FileInputFormat&lt;FileValidatorDescriptor, Text&gt; {</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco; min-height:15px=
"><br>
</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco">&nbsp; &nbsp; <=
span style=3D"color:#4e9072">
/*</span></p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco; color:rgb(78,14=
4,114)">
&nbsp;&nbsp; &nbsp; * (non-<span style=3D"text-decoration:underline">Javado=
c</span>)</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco; color:rgb(78,14=
4,114)">
&nbsp;&nbsp; &nbsp; *&nbsp;</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco; color:rgb(78,14=
4,114)">
&nbsp;&nbsp; &nbsp; * @see</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco; color:rgb(78,14=
4,114)">
&nbsp;&nbsp; &nbsp; * org.apache.hadoop.mapreduce.InputFormat#createRecordR=
eader(org.apache</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco; color:rgb(78,14=
4,114)">
&nbsp;&nbsp; &nbsp; * .hadoop.mapreduce.InputSplit,</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco; color:rgb(78,14=
4,114)">
&nbsp;&nbsp; &nbsp; * org.apache.hadoop.mapreduce.TaskAttemptContext)</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco; color:rgb(78,14=
4,114)">
&nbsp;&nbsp; &nbsp; */</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco; color:rgb(119,1=
19,119)">
<span style=3D"color:#000000">&nbsp; &nbsp; </span>@Override</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco">&nbsp; &nbsp; <=
span style=3D"color:#931a68">
public</span> RecordReader&lt;FileValidatorDescriptor, Text&gt; createRecor=
dReader(</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco">&nbsp; &nbsp; &=
nbsp; &nbsp; &nbsp; &nbsp; InputSplit split, TaskAttemptContext context) {<=
/p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco">&nbsp; &nbsp; &=
nbsp; &nbsp; String delimiter =3D context.getConfiguration().get(</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco; color:rgb(57,51=
,255)"><span style=3D"color:#000000">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nb=
sp; &nbsp; &nbsp;
</span>&quot;textinputformat.record.delimiter&quot;<span style=3D"color:#00=
0000">);</span></p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco">&nbsp; &nbsp; &=
nbsp; &nbsp; <span style=3D"color:#931a68">
byte</span>[] recordDelimiterBytes =3D <span style=3D"color:#931a68">null</=
span>;</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco">&nbsp; &nbsp; &=
nbsp; &nbsp; <span style=3D"color:#931a68">
if</span> (<span style=3D"color:#931a68">null</span> !=3D delimiter)</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco">&nbsp; &nbsp; &=
nbsp; &nbsp; &nbsp; &nbsp; recordDelimiterBytes =3D delimiter.getBytes();</=
p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco">&nbsp; &nbsp; &=
nbsp; &nbsp; <span style=3D"color:#931a68">
return</span> <span style=3D"color:#931a68">new</span> CVSLineRecordReader(=
recordDelimiterBytes);</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco">&nbsp; &nbsp; }=
</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco; min-height:15px=
"><br>
</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco">&nbsp; &nbsp; <=
span style=3D"color:#4e9072">
/*</span></p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco; color:rgb(78,14=
4,114)">
&nbsp;&nbsp; &nbsp; * (non-<span style=3D"text-decoration:underline">Javado=
c</span>)</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco; color:rgb(78,14=
4,114)">
&nbsp;&nbsp; &nbsp; *&nbsp;</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco; color:rgb(78,14=
4,114)">
&nbsp;&nbsp; &nbsp; * @see</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco; color:rgb(78,14=
4,114)">
&nbsp;&nbsp; &nbsp; * org.apache.hadoop.mapreduce.lib.input.FileInputFormat=
#isSplitable(org</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco; color:rgb(78,14=
4,114)">
&nbsp;&nbsp; &nbsp; * .apache.hadoop.mapreduce.JobContext, org.apache.hadoo=
p.fs.Path)</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco; color:rgb(78,14=
4,114)">
&nbsp;&nbsp; &nbsp; */</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco; color:rgb(119,1=
19,119)">
<span style=3D"color:#000000">&nbsp; &nbsp; </span>@Override</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco">&nbsp; &nbsp; <=
span style=3D"color:#931a68">
protected</span> <span style=3D"color:#931a68">boolean</span> isSplitable(J=
obContext context, Path file) {</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco">&nbsp; &nbsp; &=
nbsp; &nbsp; CompressionCodec codec =3D
<span style=3D"color:#931a68">new</span> CompressionCodecFactory(</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco">&nbsp; &nbsp; &=
nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; context.getConfiguration()).getCod=
ec(file);</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco">&nbsp; &nbsp; &=
nbsp; &nbsp; <span style=3D"color:#931a68">
return</span> codec =3D=3D <span style=3D"color:#931a68">null</span>;</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco">&nbsp; &nbsp; }=
</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco">}</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco"><br>
</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco">the recordReade=
r:</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco"><br>
</p>
<p style=3D"margin:0px; font-size:11px; font-family:Monaco"></p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">public class CVSLineRecordReader extends</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; RecordReader&lt;FileVal=
idatorDescriptor, Text&gt; {</p>
<p style=3D"margin:0px">&nbsp; &nbsp; private static final Log LOG =3D LogF=
actory.getLog(CVSLineRecordReader.class);</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">&nbsp; &nbsp; public static final String CVS_FIRST_=
LINE =3D &quot;file.first.line&quot;;</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">&nbsp; &nbsp; private long start;</p>
<p style=3D"margin:0px">&nbsp; &nbsp; private long pos;</p>
<p style=3D"margin:0px">&nbsp; &nbsp; private long end;</p>
<p style=3D"margin:0px">&nbsp; &nbsp; private LineReader in;</p>
<p style=3D"margin:0px">&nbsp; &nbsp; private int maxLineLength;</p>
<p style=3D"margin:0px">&nbsp; &nbsp; private FileValidatorDescriptor key =
=3D null;</p>
<p style=3D"margin:0px">&nbsp; &nbsp; private Text value =3D null;</p>
<p style=3D"margin:0px">&nbsp; &nbsp; private Text data =3D null;</p>
<p style=3D"margin:0px">&nbsp; &nbsp; private byte[] recordDelimiterBytes;<=
/p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">&nbsp; &nbsp; public CVSLineRecordReader(byte[] rec=
ordDelimiter) {</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; this.recordDelimiterByt=
es =3D recordDelimiter;</p>
<p style=3D"margin:0px">&nbsp; &nbsp; }</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">&nbsp; &nbsp; @Override</p>
<p style=3D"margin:0px">&nbsp; &nbsp; public void initialize(InputSplit gen=
ericSplit, TaskAttemptContext context)</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; throws IO=
Exception {</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; Properties properties =
=3D new Properties();</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; Configuration configura=
tion =3D context.getConfiguration();</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; Path[] cacheFiles =3D D=
istributedCache.getLocalCacheFiles(context</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &n=
bsp; .getConfiguration());</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; for (Path cacheFile : c=
acheFiles) {</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if (cache=
File.toString().endsWith(</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &n=
bsp; &nbsp; &nbsp; context.getConfiguration().get(VALIDATOR_CONF_PATH))) {<=
/p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &n=
bsp; properties.load(new FileReader(cacheFile.toString()));</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; }</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; FileSplit split =3D (Fi=
leSplit) genericSplit;</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; Configuration job =3D c=
ontext.getConfiguration();</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; this.maxLineLength =3D =
job.getInt(&quot;mapred.linerecordreader.maxlength&quot;,</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &n=
bsp; Integer.MAX_VALUE);</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; start =3D split.getStar=
t();</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; end =3D start &#43; spl=
it.getLength();</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; pos =3D start;</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; final Path file =3D spl=
it.getPath();</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; // open the file and se=
ek to the start of the split</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; FileSystem fs =3D file.=
getFileSystem(job);</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; FSDataInputStream fileI=
n =3D fs.open(split.getPath());</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; this.in =3D generateRea=
der(fileIn, job);</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; // if CVS_FIRST_LINE do=
es not exist in conf then the csv file first line</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; // is the header</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; if (properties.contains=
Key(CVS_FIRST_LINE)) {</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; configura=
tion.set(CVS_FIRST_LINE, properties.get(CVS_FIRST_LINE)</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &n=
bsp; &nbsp; &nbsp; .toString());</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; } else {</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; readData(=
);</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; configura=
tion.set(CVS_FIRST_LINE, data.toString());</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if (start=
 !=3D 0) {</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &n=
bsp; fileIn.seek(start);</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &n=
bsp; in =3D generateReader(fileIn, job);</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &n=
bsp; pos =3D start;</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; }</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; key =3D new FileValidat=
orDescriptor();</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; key.setFileName(split.g=
etPath().getName());</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; context.getConfiguratio=
n().set(&quot;file.name&quot;, key.getFileName());</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">&nbsp; &nbsp; }</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">&nbsp; &nbsp; @Override</p>
<p style=3D"margin:0px">&nbsp; &nbsp; public boolean nextKeyValue() throws =
IOException {</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; int newSize =3D readDat=
a();</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; if (newSize =3D=3D 0) {=
</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; key =3D n=
ull;</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; value =3D=
 null;</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return fa=
lse;</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; } else {</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; key.setOf=
fset(this.pos);</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; value =3D=
 data;</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return tr=
ue;</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; }</p>
<p style=3D"margin:0px">&nbsp; &nbsp; }</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">&nbsp; &nbsp; private LineReader generateReader(FSD=
ataInputStream fileIn,</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Configura=
tion job) throws IOException {</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; if (null =3D=3D this.re=
cordDelimiterBytes) {</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return ne=
w LineReader(fileIn, job);</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; } else {</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return ne=
w LineReader(fileIn, job, this.recordDelimiterBytes);</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; }</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">&nbsp; &nbsp; }</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">&nbsp; &nbsp; private int readData() throws IOExcep=
tion {</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; if (data =3D=3D null) {=
</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; data =3D =
new Text();</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; }</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; int newSize =3D 0;</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; while (pos &lt; end) {<=
/p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; newSize =
=3D in.readLine(data, maxLineLength,</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &n=
bsp; &nbsp; &nbsp; Math.max((int) Math.min(Integer.MAX_VALUE, end - pos),</=
p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &n=
bsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; maxLineLength));</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if (newSi=
ze =3D=3D 0) {</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &n=
bsp; break;</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; pos &#43;=
=3D newSize;</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if (newSi=
ze &lt; maxLineLength) {</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &n=
bsp; break;</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // line t=
oo long. try again</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; LOG.info(=
&quot;Skipped line of size &quot; &#43; newSize &#43; &quot; at pos &quot;<=
/p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &n=
bsp; &nbsp; &nbsp; &#43; (pos - newSize));</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; }</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; return newSize;</p>
<p style=3D"margin:0px">&nbsp; &nbsp; }</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">&nbsp; &nbsp; @Override</p>
<p style=3D"margin:0px">&nbsp; &nbsp; public FileValidatorDescriptor getCur=
rentKey() {</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; return key;</p>
<p style=3D"margin:0px">&nbsp; &nbsp; }</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">&nbsp; &nbsp; @Override</p>
<p style=3D"margin:0px">&nbsp; &nbsp; public Text getCurrentValue() {</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; return value;</p>
<p style=3D"margin:0px">&nbsp; &nbsp; }</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">&nbsp; &nbsp; @Override</p>
<p style=3D"margin:0px">&nbsp; &nbsp; public float getProgress() {</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; if (start =3D=3D end) {=
</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return 0.=
0f;</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; } else {</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return Ma=
th.min(1.0f, (pos - start) / (float) (end - start));</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; }</p>
<p style=3D"margin:0px">&nbsp; &nbsp; }</p>
<p style=3D"margin:0px"><br>
</p>
<p style=3D"margin:0px">&nbsp; &nbsp; @Override</p>
<p style=3D"margin:0px">&nbsp; &nbsp; public synchronized void close() thro=
ws IOException {</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; if (in !=3D null) {</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; in.close(=
);</p>
<p style=3D"margin:0px">&nbsp; &nbsp; &nbsp; &nbsp; }</p>
<p style=3D"margin:0px">&nbsp; &nbsp; }</p>
<p style=3D"margin:0px">}</p>
<div><br>
</div>
<div>Thanks.</div>
<p></p>
</div>
<div><br>
</div>
<span id=3D"OLK_SRC_BODY_SECTION">
<div style=3D"font-family:Calibri; font-size:11pt; text-align:left; color:b=
lack; border-bottom:medium none; border-left:medium none; padding-bottom:0i=
n; padding-left:0in; padding-right:0in; border-top:#b5c4df 1pt solid; borde=
r-right:medium none; padding-top:3pt">
<span style=3D"font-weight:bold">De: </span>Azuryy Yu &lt;<a href=3D"mailto=
:azuryyyu@gmail.com">azuryyyu@gmail.com</a>&gt;<br>
<span style=3D"font-weight:bold">Responder a: </span>&quot;<a href=3D"mailt=
o:user@hadoop.apache.org">user@hadoop.apache.org</a>&quot; &lt;<a href=3D"m=
ailto:user@hadoop.apache.org">user@hadoop.apache.org</a>&gt;<br>
<span style=3D"font-weight:bold">Fecha: </span>viernes, 22 de noviembre de =
2013 12:19<br>
<span style=3D"font-weight:bold">Para: </span>&quot;<a href=3D"mailto:user@=
hadoop.apache.org">user@hadoop.apache.org</a>&quot; &lt;<a href=3D"mailto:u=
ser@hadoop.apache.org">user@hadoop.apache.org</a>&gt;<br>
<span style=3D"font-weight:bold">Asunto: </span>Re: Missing records from HD=
FS<br>
</div>
<div><br>
</div>
<div>
<div>
<div dir=3D"ltr">I do think this is because of your RecorderReader, can you=
 paste your code here? and give a piece of data example.
<div><br>
</div>
<div>please use pastebin if you want.&nbsp;</div>
</div>
<div class=3D"gmail_extra"><br>
<br>
<div class=3D"gmail_quote">On Fri, Nov 22, 2013 at 7:16 PM, ZORAIDA HIDALGO=
 SANCHEZ
<span dir=3D"ltr">&lt;<a href=3D"mailto:zoraida@tid.es" target=3D"_blank">z=
oraida@tid.es</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex; border-left:1=
px #ccc solid; padding-left:1ex">
<div style=3D"font-size:14px; font-family:Calibri,sans-serif; word-wrap:bre=
ak-word">
<div>One more thing,</div>
<div><br>
</div>
<div>if we split the files then all the records are processed. Files are of=
&nbsp;70,5MB.</div>
<div><br>
</div>
<div>Thanks,</div>
<div><br>
</div>
<div>Zoraida.-</div>
<div><br>
</div>
<span>
<div style=3D"border-right:medium none; padding-right:0in; padding-left:0in=
; padding-top:3pt; text-align:left; font-size:11pt; border-bottom:medium no=
ne; font-family:Calibri; border-top:#b5c4df 1pt solid; padding-bottom:0in; =
border-left:medium none">
<span style=3D"font-weight:bold">De: </span>zoraida &lt;<a href=3D"mailto:z=
oraida@tid.es" target=3D"_blank">zoraida@tid.es</a>&gt;<br>
<span style=3D"font-weight:bold">Fecha: </span>viernes, 22 de noviembre de =
2013 08:59
<div>
<div class=3D"h5"><br>
<span style=3D"font-weight:bold">Para: </span>&quot;<a href=3D"mailto:user@=
hadoop.apache.org" target=3D"_blank">user@hadoop.apache.org</a>&quot; &lt;<=
a href=3D"mailto:user@hadoop.apache.org" target=3D"_blank">user@hadoop.apac=
he.org</a>&gt;<br>
<span style=3D"font-weight:bold">Asunto: </span>Re: Missing records from HD=
FS<br>
</div>
</div>
</div>
<div>
<div class=3D"h5">
<div><br>
</div>
<div>
<div style=3D"word-wrap:break-word">
<div style=3D"font-size:14px; font-family:Calibri,sans-serif">Thanks for yo=
ur response Azuryy.</div>
<div style=3D"font-size:14px; font-family:Calibri,sans-serif"><br>
</div>
<div style=3D"font-size:14px; font-family:Calibri,sans-serif">My hadoop ver=
sion:&nbsp;2.0.0-cdh4.3.0</div>
<div style=3D"font-size:14px; font-family:Calibri,sans-serif">InputFormat: =
a custom class that extends from&nbsp;<span style=3D"font-family:Monaco; fo=
nt-size:11px">FileInputFormat(csv input format)</span></div>
<div style=3D"font-size:14px; font-family:Calibri,sans-serif"><span style=
=3D"font-family:Monaco; font-size:11px">These fiels are under the same dire=
ctory, different files.</span></div>
<div><font face=3D"Monaco">My input path is configured using oozie&nbsp;thr=
oughout&nbsp;the&nbsp;propertie&nbsp;</font><span style=3D"font-family:Mona=
co; font-size:11px">mapred.input.dir.&nbsp;</span></div>
<div><br>
</div>
<div style=3D"font-size:14px; font-family:Calibri,sans-serif"><span style=
=3D"font-family:Monaco; font-size:11px"><br>
</span></div>
<div style=3D"font-size:14px; font-family:Calibri,sans-serif"><span style=
=3D"font-family:Monaco; font-size:11px">Same code and input running on&nbsp=
;</span>Hadoop 2.0.0-cdh4.2.1 works fine. Does not discard any record.</div=
>
<div style=3D"font-size:14px; font-family:Calibri,sans-serif"><br>
</div>
<div style=3D"font-size:14px; font-family:Calibri,sans-serif">Thanks.</div>
<div style=3D"font-size:14px; font-family:Calibri,sans-serif"><br>
</div>
<span style=3D"font-size:14px; font-family:Calibri,sans-serif">
<div style=3D"border-right:medium none; padding-right:0in; padding-left:0in=
; padding-top:3pt; text-align:left; font-size:11pt; border-bottom:medium no=
ne; font-family:Calibri; border-top:#b5c4df 1pt solid; padding-bottom:0in; =
border-left:medium none">
<span style=3D"font-weight:bold">De: </span>Azuryy Yu &lt;<a href=3D"mailto=
:azuryyyu@gmail.com" target=3D"_blank">azuryyyu@gmail.com</a>&gt;<br>
<span style=3D"font-weight:bold">Responder a: </span>&quot;<a href=3D"mailt=
o:user@hadoop.apache.org" target=3D"_blank">user@hadoop.apache.org</a>&quot=
; &lt;<a href=3D"mailto:user@hadoop.apache.org" target=3D"_blank">user@hado=
op.apache.org</a>&gt;<br>
<span style=3D"font-weight:bold">Fecha: </span>jueves, 21 de noviembre de 2=
013 07:31<br>
<span style=3D"font-weight:bold">Para: </span>&quot;<a href=3D"mailto:user@=
hadoop.apache.org" target=3D"_blank">user@hadoop.apache.org</a>&quot; &lt;<=
a href=3D"mailto:user@hadoop.apache.org" target=3D"_blank">user@hadoop.apac=
he.org</a>&gt;<br>
<span style=3D"font-weight:bold">Asunto: </span>Re: Missing records from HD=
FS<br>
</div>
<div><br>
</div>
<div>
<div>
<div dir=3D"ltr">what's your hadoop version? and which InputFormat are you =
used?
<div><br>
</div>
<div>these files under one directory or there are lots of subdirectory? how=
 ddi you configure input path in your main?</div>
<div><br>
</div>
</div>
<div class=3D"gmail_extra"><br>
<br>
<div class=3D"gmail_quote">On Thu, Nov 21, 2013 at 12:25 AM, ZORAIDA HIDALG=
O SANCHEZ
<span dir=3D"ltr">&lt;<a href=3D"mailto:zoraida@tid.es" target=3D"_blank">z=
oraida@tid.es</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex; border-left:1=
px #ccc solid; padding-left:1ex">
<div style=3D"font-size:14px; font-family:Calibri,sans-serif; word-wrap:bre=
ak-word">
<div>Hi all,</div>
<div><br>
</div>
<div>my job is not reading all the input records. In the input directory I =
have a set of files containing a total of 6000000 records but only 5997000 =
are processed. The&nbsp;Map Input Records counter says&nbsp;5997000.</div>
<div>I have tried downloading the files with a getmerge to check how many r=
ecords would return but the correct number is returned(6000000).</div>
<div><br>
</div>
<div>Do you have any suggestion?&nbsp;</div>
<div><br>
</div>
<div>Thanks.&nbsp;</div>
<br>
<hr>
<font face=3D"Arial" color=3D"Gray" size=3D"1"><br>
Este mensaje se dirige exclusivamente a su destinatario. Puede consultar nu=
estra pol=EDtica de env=EDo y recepci=F3n de correo electr=F3nico en el enl=
ace situado m=E1s abajo.<br>
This message is intended exclusively for its addressee. We only send and re=
ceive email on the basis of the terms set out at:<br>
<a href=3D"http://www.tid.es/ES/PAGINAS/disclaimer.aspx" target=3D"_blank">=
http://www.tid.es/ES/PAGINAS/disclaimer.aspx</a><br>
</font></div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</span></div>
</div>
</div>
</div>
</span>
<div>
<div class=3D"h5"><br>
<hr>
<font face=3D"Arial" color=3D"Gray" size=3D"1"><br>
Este mensaje se dirige exclusivamente a su destinatario. Puede consultar nu=
estra pol=EDtica de env=EDo y recepci=F3n de correo electr=F3nico en el enl=
ace situado m=E1s abajo.<br>
This message is intended exclusively for its addressee. We only send and re=
ceive email on the basis of the terms set out at:<br>
<a href=3D"http://www.tid.es/ES/PAGINAS/disclaimer.aspx" target=3D"_blank">=
http://www.tid.es/ES/PAGINAS/disclaimer.aspx</a><br>
</font></div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</span><br>
<hr>
<font face=3D"Arial" color=3D"Gray" size=3D"1"><br>
Este mensaje se dirige exclusivamente a su destinatario. Puede consultar nu=
estra pol=EDtica de env=EDo y recepci=F3n de correo electr=F3nico en el enl=
ace situado m=E1s abajo.<br>
This message is intended exclusively for its addressee. We only send and re=
ceive email on the basis of the terms set out at:<br>
http://www.tid.es/ES/PAGINAS/disclaimer.aspx<br>
</font>
</body>
</html>

--Boundary_(ID_4ZpH6PowtawoNd+T/ba7Rw)--