Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of hemanty@thoughtworks.com
 designates 64.18.0.24 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <05ed01cdd394$41c51cf0$c54f56d0$@yahoo.com>
References: <059401cdd379$093452e0$1b9cf8a0$@yahoo.com>
	<1354776326.401.YahooMailNeo@web160704.mail.bf1.yahoo.com>
	<05bc01cdd384$70d78a10$52869e30$@yahoo.com>
	<CAEAKFL9y5ub=cB-NPi8C0Kao3hygzMU5JkhdthVXc1dCiCPR_g@mail.gmail.com>
	<05ed01cdd394$41c51cf0$c54f56d0$@yahoo.com>
Date: Thu, 6 Dec 2012 20:13:52 +0530
Message-ID: 
 <CAEAKFL-AVitL-EBLGte2QOR8qnP0NF-5297_0UZ5UtrsB=7dYg@mail.gmail.com>
Subject: Re: Map tasks processing some files multiple times
From: Hemanth Yamijala <yhemanth@thoughtworks.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=089e013a11a2b60b6504d03020d7

--089e013a11a2b60b6504d03020d7
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Glad it helps. Could you also explain the reason for using MultipleInputs ?


On Thu, Dec 6, 2012 at 2:59 PM, David Parks <davidparks21@yahoo.com> wrote:

> Figured it out, it is, as usual, with my code. I had wrapped
> TextInputFormat to replace the LongWritable key with a key representing t=
he
> file name. It was a bit tricky to do because of changing the generics fro=
m
> <LongWritable, Text> to <Text, Text> and I goofed up and mis-directed a
> call to isSplittable, which was causing the issue.****
>
> ** **
>
> It now works fine. Thanks very much for the response, it gave me pause to
> think enough to work out what I had done.****
>
> ** **
>
> Dave****
>
> ** **
>
> ** **
>
> *From:* Hemanth Yamijala [mailto:yhemanth@thoughtworks.com]
> *Sent:* Thursday, December 06, 2012 3:25 PM
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Map tasks processing some files multiple times****
>
> ** **
>
> David,****
>
> ** **
>
> You are using FileNameTextInputFormat. This is not in Hadoop source, as
> far as I can see. Can you please confirm where this is being used from ? =
It
> seems like the isSplittable method of this input format may need checking=
.
> ****
>
> ** **
>
> Another thing, given you are adding the same input format for all files,
> do you need MultipleInputs ?****
>
> ** **
>
> Thanks****
>
> Hemanth****
>
> ** **
>
> On Thu, Dec 6, 2012 at 1:06 PM, David Parks <davidparks21@yahoo.com>
> wrote:****
>
> I believe I just tracked down the problem, maybe you can help confirm if
> you=92re familiar with this.****
>
>  ****
>
> I see that FileInputFormat is specifying that gzip files (.gz extension)
> from s3n filesystem are being reported as *splittable*, and I see that
> it=92s creating multiple input splits for these files. I=92m mapping the =
files
> directly off S3:****
>
>  ****
>
>        Path lsDir =3D *new* Path(
> "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");****
>
>        MultipleInputs.*addInputPath*(job, lsDir, FileNameTextInputFormat.=
*
> class*, LinkShareCatalogImportMapper.*class*);****
>
>  ****
>
> I see in the map phase, based on my counters, that it=92s actually
> processing the entire file (I set up a counter per file input). So the 2
> files which were processed twice had 2 splits (I now see that in some deb=
ug
> logs I created), and the 1 file that was processed 3 times had 3 splits
> (the rest were smaller and were only assigned one split by default anyway=
).
> ****
>
>  ****
>
> Am I wrong in expecting all files on the s3n filesystem to come through a=
s
> not-splittable? This seems to be a bug in hadoop code if I=92m right.****
>
>  ****
>
> David****
>
>  ****
>
>  ****
>
> *From:* Raj Vishwanathan [mailto:rajvish@yahoo.com]
> *Sent:* Thursday, December 06, 2012 1:45 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Map tasks processing some files multiple times****
>
>  ****
>
> Could it be due to spec-ex? Does it make a diffrerence in the end?****
>
>  ****
>
> Raj****
>
>  ****
> ------------------------------
>
> *From:* David Parks <davidparks21@yahoo.com>
> *To:* user@hadoop.apache.org
> *Sent:* Wednesday, December 5, 2012 10:15 PM
> *Subject:* Map tasks processing some files multiple times****
>
>  ****
>
> I=92ve got a job that reads in 167 files from S3, but 2 of the files are
> being mapped twice and 1 of the files is mapped 3 times.****
>
>  ****
>
> This is the code I use to set up the mapper:****
>
>  ****
>
>        Path lsDir =3D *new* Path(
> "s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");****
>
>        *for*(FileStatus f :
> lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info("Identified
> linkshare catalog: " + f.getPath().toString());****
>
>        *if*( lsDir.getFileSystem(getConf()).globStatus(lsDir).length > 0
> ){****
>
>               MultipleInputs.*addInputPath*(job, lsDir,
> FileNameTextInputFormat.*class*, LinkShareCatalogImportMapper.*class*);**=
*
> *
>
>        }****
>
>  ****
>
> I can see from the logs that it sees only 1 copy of each of these files,
> and correctly identifies 167 files.****
>
>  ****
>
> I also have the following confirmation that it found the 167 files
> correctly:****
>
>  ****
>
> 2012-12-06 04:56:41,213 INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input
> paths to process : 167****
>
>  ****
>
> When I look through the syslogs I can see that the file in question was
> opened by two different map attempts:****
>
>  ****
>
> ./task-attempts/job_201212060351_0001/*
> attempt_201212060351_0001_m_000005_0*/syslog:2012-12-06 03:56:05,265 INFO
> org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
> 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Un=
iverse~85.csv.gz'
> for reading****
>
> ./task-attempts/job_201212060351_0001/*
> attempt_201212060351_0001_m_000173_0*/syslog:2012-12-06 03:53:18,765 INFO
> org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening
> 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Un=
iverse~85.csv.gz'
> for reading****
>
>  ****
>
> This is only happening to these 3 files, all others seem to be fine. For
> the life of me I can=92t see a reason why these files might be processed
> multiple times.****
>
>  ****
>
> Notably, map attempt 173 is more map attempts than should be possible.
> There are 167 input files (from S3, gzipped), thus there should be 167 ma=
p
> attempts. But I see a total of 176 map tasks.****
>
>  ****
>
> Any thoughts/ideas/guesses?****
>
>  ****
>
>  ****
>
> ** **
>

--089e013a11a2b60b6504d03020d7
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Glad it helps. Could you also explain the reason for using MultipleInputs ?=
<div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Thu, Dec 6=
, 2012 at 2:59 PM, David Parks <span dir=3D"ltr">&lt;<a href=3D"mailto:davi=
dparks21@yahoo.com" target=3D"_blank">davidparks21@yahoo.com</a>&gt;</span>=
 wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div lang=3D"EN-US" link=3D"blue" vlink=3D"p=
urple"><div><p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-fam=
ily:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">Figured it ou=
t, it is, as usual, with my code. I had wrapped TextInputFormat to replace =
the LongWritable key with a key representing the file name. It was a bit tr=
icky to do because of changing the generics from &lt;LongWritable, Text&gt;=
 to &lt;Text, Text&gt; and I goofed up and mis-directed a call to isSplitta=
ble, which was causing the issue.<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>=A0<u></u></span><=
/p><p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot=
;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">It now works fine. Tha=
nks very much for the response, it gave me pause to think enough to work ou=
t what I had done.<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>=A0<u></u></span><=
/p><p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot=
;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">Dave<u></u><u></u></sp=
an></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>=A0<u></u></span><=
/p><p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot=
;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>=A0<u></u></spa=
n></p>
<p class=3D"MsoNormal"><b><span style=3D"font-size:10.0pt;font-family:&quot=
;Tahoma&quot;,&quot;sans-serif&quot;">From:</span></b><span style=3D"font-s=
ize:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;"> Hemanth =
Yamijala [mailto:<a href=3D"mailto:yhemanth@thoughtworks.com" target=3D"_bl=
ank">yhemanth@thoughtworks.com</a>] <br>
<b>Sent:</b> Thursday, December 06, 2012 3:25 PM</span></p><div><div class=
=3D"h5"><br><b>To:</b> <a href=3D"mailto:user@hadoop.apache.org" target=3D"=
_blank">user@hadoop.apache.org</a><br><b>Subject:</b> Re: Map tasks process=
ing some files multiple times<u></u><u></u></div>
</div><p></p><div><div class=3D"h5"><p class=3D"MsoNormal"><u></u>=A0<u></u=
></p><p class=3D"MsoNormal">David,<u></u><u></u></p><div><p class=3D"MsoNor=
mal"><u></u>=A0<u></u></p></div><div><p class=3D"MsoNormal">You are using=
=A0<span style=3D"font-size:10.0pt;font-family:Consolas">FileNameTextInputF=
ormat. This is not in Hadoop source, as far as I can see. Can you please co=
nfirm where this is being used from ? It seems like the isSplittable method=
 of this input format may need checking.</span><u></u><u></u></p>
</div><div><p class=3D"MsoNormal"><u></u>=A0<u></u></p></div><div><p class=
=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:Consolas">Anothe=
r thing, given you are adding the same input format for all files, do you n=
eed MultipleInputs ?</span><u></u><u></u></p>
</div><div><p class=3D"MsoNormal"><u></u>=A0<u></u></p></div><div><p class=
=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family:Consolas">Thanks=
</span><u></u><u></u></p></div><div><p class=3D"MsoNormal"><span style=3D"f=
ont-size:10.0pt;font-family:Consolas">Hemanth</span><u></u><u></u></p>
</div><div><p class=3D"MsoNormal" style=3D"margin-bottom:12.0pt"><u></u>=A0=
<u></u></p><div><p class=3D"MsoNormal">On Thu, Dec 6, 2012 at 1:06 PM, Davi=
d Parks &lt;<a href=3D"mailto:davidparks21@yahoo.com" target=3D"_blank">dav=
idparks21@yahoo.com</a>&gt; wrote:<u></u><u></u></p>
<div><div><p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-famil=
y:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">I believe I jus=
t tracked down the problem, maybe you can help confirm if you=92re familiar=
 with this.</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">=A0</span><u></u><u></u><=
/p><p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot=
;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">I see that FileInputFo=
rmat is specifying that gzip files (.gz extension) from s3n filesystem are =
being reported as <i>splittable</i>, and I see that it=92s creating multipl=
e input splits for these files. I=92m mapping the files directly off S3:</s=
pan><u></u><u></u></p>
<div><p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&qu=
ot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">=A0</span><u></u><u>=
</u></p><p class=3D"MsoNormal" style=3D"text-autospace:none"><span style=3D=
"font-size:10.0pt;font-family:Consolas">=A0=A0=A0=A0=A0=A0 Path lsDir =3D <=
b><span style=3D"color:#7f0055">new</span></b> Path(<span style=3D"color:#2=
a00ff">&quot;s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*&quo=
t;</span>);</span><u></u><u></u></p>
</div><p class=3D"MsoNormal" style=3D"text-autospace:none"><span style=3D"f=
ont-size:10.0pt;font-family:Consolas">=A0=A0=A0=A0=A0=A0 MultipleInputs.<i>=
addInputPath</i>(job, lsDir, FileNameTextInputFormat.<b><span style=3D"colo=
r:#7f0055">class</span></b>, LinkShareCatalogImportMapper.<b><span style=3D=
"color:#7f0055">class</span></b>);</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">=A0</span><u></u><u></u><=
/p><p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot=
;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">I see in the map phase=
, based on my counters, that it=92s actually processing the entire file (I =
set up a counter per file input). So the 2 files which were processed twice=
 had 2 splits (I now see that in some debug logs I created), and the 1 file=
 that was processed 3 times had 3 splits (the rest were smaller and were on=
ly assigned one split by default anyway).</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">=A0</span><u></u><u></u><=
/p><p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot=
;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">Am I wrong in expectin=
g all files on the s3n filesystem to come through as not-splittable? This s=
eems to be a bug in hadoop code if I=92m right.</span><u></u><u></u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">=A0</span><u></u><u></u><=
/p><p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot=
;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">David</span><u></u><u>=
</u></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">=A0</span><u></u><u></u><=
/p><p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot=
;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">=A0</span><u></u><u></=
u></p>
<div><div style=3D"border:none;border-top:solid #b5c4df 1.0pt;padding:3.0pt=
 0in 0in 0in"><p class=3D"MsoNormal"><b><span style=3D"font-size:10.0pt;fon=
t-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;">From:</span></b><span s=
tyle=3D"font-size:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&qu=
ot;"> Raj Vishwanathan [mailto:<a href=3D"mailto:rajvish@yahoo.com" target=
=3D"_blank">rajvish@yahoo.com</a>] <br>
<b>Sent:</b> Thursday, December 06, 2012 1:45 PM<br><b>To:</b> <a href=3D"m=
ailto:user@hadoop.apache.org" target=3D"_blank">user@hadoop.apache.org</a><=
br><b>Subject:</b> Re: Map tasks processing some files multiple times</span=
><u></u><u></u></p>
</div></div><div><div><p class=3D"MsoNormal">=A0<u></u><u></u></p><div><div=
><p class=3D"MsoNormal" style=3D"background:white"><span style=3D"font-size=
:10.0pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;">Could it be d=
ue to spec-ex? Does it make a diffrerence in the end?</span><u></u><u></u><=
/p>
</div><div><p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-fami=
ly:&quot;Arial&quot;,&quot;sans-serif&quot;">=A0</span><u></u><u></u></p></=
div><div><p class=3D"MsoNormal"><span style=3D"font-size:10.0pt;font-family=
:&quot;Arial&quot;,&quot;sans-serif&quot;">Raj</span><u></u><u></u></p>
</div><div><blockquote style=3D"border:none;border-left:solid #1010ff 1.5pt=
;padding:0in 0in 0in 4.0pt;margin-left:3.75pt;margin-top:3.75pt;margin-bott=
om:5.0pt"><p class=3D"MsoNormal" style=3D"background:white"><span style=3D"=
font-size:10.0pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;">=A0<=
/span><u></u><u></u></p>
<div><div><div><div class=3D"MsoNormal" align=3D"center" style=3D"text-alig=
n:center;background:white"><span style=3D"font-size:10.0pt;font-family:&quo=
t;Arial&quot;,&quot;sans-serif&quot;"><hr size=3D"1" width=3D"100%" align=
=3D"center">
</span></div><p class=3D"MsoNormal" style=3D"background:white"><b><span sty=
le=3D"font-size:10.0pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;=
">From:</span></b><span style=3D"font-size:10.0pt;font-family:&quot;Arial&q=
uot;,&quot;sans-serif&quot;"> David Parks &lt;<a href=3D"mailto:davidparks2=
1@yahoo.com" target=3D"_blank">davidparks21@yahoo.com</a>&gt;<br>
<b>To:</b> <a href=3D"mailto:user@hadoop.apache.org" target=3D"_blank">user=
@hadoop.apache.org</a> <br><b>Sent:</b> Wednesday, December 5, 2012 10:15 P=
M<br><b>Subject:</b> Map tasks processing some files multiple times</span><=
u></u><u></u></p>
</div><p class=3D"MsoNormal" style=3D"background:white">=A0<u></u><u></u></=
p><div><div><div><div><p class=3D"MsoNormal" style=3D"background:white"><sp=
an style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-ser=
if&quot;;color:#1f497d">I=92ve got a job that reads in 167 files from S3, b=
ut 2 of the files are being mapped twice and 1 of the files is mapped 3 tim=
es.</span><u></u><u></u></p>
</div><div><p class=3D"MsoNormal" style=3D"background:white"><span style=3D=
"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;co=
lor:#1f497d">=A0</span><u></u><u></u></p></div><div><p class=3D"MsoNormal" =
style=3D"background:white">
<span style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-=
serif&quot;;color:#1f497d">This is the code I use to set up the mapper:</sp=
an><u></u><u></u></p></div><div><p class=3D"MsoNormal" style=3D"background:=
white">
<span style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-=
serif&quot;;color:#1f497d">=A0</span><u></u><u></u></p></div><div><p class=
=3D"MsoNormal" style=3D"background:white"><span style=3D"font-size:10.0pt;f=
ont-family:Consolas">=A0=A0=A0=A0=A0=A0 Path lsDir =3D <b><span style=3D"co=
lor:#7f0055">new</span></b> Path(<span style=3D"color:#2a00ff">&quot;s3n://=
fruggmapreduce/input/catalogs/linkshare_catalogs/*~*&quot;</span>);</span><=
u></u><u></u></p>
</div><div><p class=3D"MsoNormal" style=3D"background:white"><span style=3D=
"font-size:10.0pt;font-family:Consolas">=A0=A0=A0=A0=A0=A0 <b><span style=
=3D"color:#7f0055">for</span></b>(FileStatus f : lsDir.getFileSystem(getCon=
f()).globStatus(lsDir)) <span style=3D"color:#0000c0">log</span>.info(<span=
 style=3D"color:#2a00ff">&quot;Identified linkshare catalog: &quot;</span> =
+ f.getPath().toString());</span><u></u><u></u></p>
</div><div><p class=3D"MsoNormal" style=3D"background:white"><span style=3D=
"font-size:10.0pt;font-family:Consolas">=A0=A0=A0=A0=A0=A0 <b><span style=
=3D"color:#7f0055">if</span></b>( lsDir.getFileSystem(getConf()).globStatus=
(lsDir).<span style=3D"color:#0000c0">length</span> &gt; 0 ){</span><u></u>=
<u></u></p>
</div><div><p class=3D"MsoNormal" style=3D"background:white"><span style=3D=
"font-size:10.0pt;font-family:Consolas">=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0 MultipleInputs.<i>addInputPath</i>(job, lsDir, FileNameTextInputForm=
at.<b><span style=3D"color:#7f0055">class</span></b>, LinkShareCatalogImpor=
tMapper.<b><span style=3D"color:#7f0055">class</span></b>);</span><u></u><u=
></u></p>
</div><div><p class=3D"MsoNormal" style=3D"background:white"><span style=3D=
"font-size:10.0pt;font-family:Consolas">=A0=A0=A0=A0=A0=A0 }</span><u></u><=
u></u></p></div><div><p class=3D"MsoNormal" style=3D"background:white"><spa=
n style=3D"font-size:10.0pt;font-family:Consolas">=A0</span><u></u><u></u><=
/p>
</div><div><p class=3D"MsoNormal" style=3D"background:white"><span style=3D=
"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;co=
lor:#1f497d">I can see from the logs that it sees only 1 copy of each of th=
ese files, and correctly identifies 167 files.</span><u></u><u></u></p>
</div><div><p class=3D"MsoNormal" style=3D"background:white"><span style=3D=
"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;co=
lor:#1f497d">=A0</span><u></u><u></u></p></div><div><p class=3D"MsoNormal" =
style=3D"background:white">
<span style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-=
serif&quot;;color:#1f497d">I also have the following confirmation that it f=
ound the 167 files correctly:</span><u></u><u></u></p></div><div><p class=
=3D"MsoNormal" style=3D"background:white">
<span style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-=
serif&quot;;color:#1f497d">=A0</span><u></u><u></u></p></div><div style=3D"=
margin-left:.5in"><p class=3D"MsoNormal" style=3D"background:white"><span s=
tyle=3D"font-size:10.0pt;font-family:&quot;Courier New&quot;">2012-12-06 04=
:56:41,213 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main=
): Total input paths to process : 167</span><u></u><u></u></p>
</div><div><p class=3D"MsoNormal" style=3D"background:white"><span style=3D=
"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;co=
lor:#1f497d">=A0</span><u></u><u></u></p></div><div><p class=3D"MsoNormal" =
style=3D"background:white">
<span style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-=
serif&quot;;color:#1f497d">When I look through the syslogs I can see that t=
he file in question was opened by two different map attempts:</span><u></u>=
<u></u></p>
</div><div><p class=3D"MsoNormal" style=3D"background:white"><span style=3D=
"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;co=
lor:#1f497d">=A0</span><u></u><u></u></p></div><div style=3D"margin-left:.5=
in"><p class=3D"MsoNormal" style=3D"background:white">
<span style=3D"font-size:10.0pt;font-family:&quot;Courier New&quot;">./task=
-attempts/job_201212060351_0001/<b>attempt_201212060351_0001_m_000005_0</b>=
/syslog:2012-12-06 03:56:05,265 INFO org.apache.hadoop.fs.s3native.NativeS3=
FileSystem (main): Opening &#39;s3n://fruggmapreduce/input/catalogs/linksha=
re_catalogs/linkshare~CD%20Universe~85.csv.gz&#39; for reading</span><u></u=
><u></u></p>
</div><div style=3D"margin-left:.5in"><p class=3D"MsoNormal" style=3D"backg=
round:white"><span style=3D"font-size:10.0pt;font-family:&quot;Courier New&=
quot;;color:#1f497d">./task-attempts/job_201212060351_0001/<b>attempt_20121=
2060351_0001_m_000173_0</b>/syslog:2012-12-06 03:53:18,765 INFO org.apache.=
hadoop.fs.s3native.NativeS3FileSystem (main): Opening &#39;s3n://fruggmapre=
duce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz=
9; for reading</span><u></u><u></u></p>
</div><div style=3D"margin-left:.5in"><p class=3D"MsoNormal" style=3D"backg=
round:white"><span style=3D"font-size:10.0pt;font-family:&quot;Courier New&=
quot;;color:#1f497d">=A0</span><u></u><u></u></p></div><div><p class=3D"Mso=
Normal" style=3D"background:white">
<span style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-=
serif&quot;;color:#1f497d">This is only happening to these 3 files, all oth=
ers seem to be fine. For the life of me I can=92t see a reason why these fi=
les might be processed multiple times.</span><u></u><u></u></p>
</div><div><p class=3D"MsoNormal" style=3D"background:white"><span style=3D=
"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;co=
lor:#1f497d">=A0</span><u></u><u></u></p></div><div><p class=3D"MsoNormal" =
style=3D"background:white">
<span style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-=
serif&quot;;color:#1f497d">Notably, map attempt 173 is more map attempts th=
an should be possible. There are 167 input files (from S3, gzipped), thus t=
here should be 167 map attempts. But I see a total of 176 map tasks.</span>=
<u></u><u></u></p>
</div><div><p class=3D"MsoNormal" style=3D"background:white"><span style=3D=
"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;co=
lor:#1f497d">=A0</span><u></u><u></u></p></div><div><p class=3D"MsoNormal" =
style=3D"background:white">
<span style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-=
serif&quot;;color:#1f497d">Any thoughts/ideas/guesses?</span><u></u><u></u>=
</p></div><div><p class=3D"MsoNormal" style=3D"background:white"><span styl=
e=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot=
;;color:#1f497d">=A0</span><u></u><u></u></p>
</div></div></div></div><p class=3D"MsoNormal" style=3D"margin-bottom:12.0p=
t;background:white">=A0<u></u><u></u></p></div></div></blockquote></div></d=
iv></div></div></div></div></div><p class=3D"MsoNormal"><u></u>=A0<u></u></=
p></div>
</div></div></div></div></blockquote></div><br></div>

--089e013a11a2b60b6504d03020d7--