Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of x6i4uyzbz.labs@gmail.com
 designates 209.85.210.181 as permitted sender)
MIME-Version: 1.0
Sender: gpolaert@gmail.com
In-Reply-To: 
 <CAOcnVr2v6O1GAhq2DchLbkjBBAcyvU-0zFp0gwb9eSfZt=n+Qw@mail.gmail.com>
References: 
 <CAPPeW7UaTC0JXz+8thrhE9SbV-HCC9qVvrXY3jNPKEUAyLwyww@mail.gmail.com>
	<CAOcnVr0-dvOiG1kna1dH_zoK2dd+5EXgyeTH_8=ya9mDMeXjQQ@mail.gmail.com>
	<CAPPeW7UOVfyzg92zZm8duT=WyX1soNDh=WpskcnVpKwHc1GyRg@mail.gmail.com>
	<CAPQV63X9tRMc+BZ5Rbkn0MSAu_e=qL7x-_nzyRPiHBec2sMrNQ@mail.gmail.com>
	<CAPPeW7UEOVewGVb9AYq_ibVe+D9ZOy6mD0F_Fsn=u3fpYFhe2A@mail.gmail.com>
	<CAOcnVr2r9JbDu7t_08FAMF8JoZ+z0BZdjzTw47NevX148HGKcg@mail.gmail.com>
	<CAPPeW7Vr_HiqAHv1oq69ooD6k7JrcMtZh83fEsup0YxjGrcyCw@mail.gmail.com>
	<CAOcnVr2v6O1GAhq2DchLbkjBBAcyvU-0zFp0gwb9eSfZt=n+Qw@mail.gmail.com>
Date: Thu, 6 Dec 2012 17:53:05 +0100
Message-ID: 
 <CAPPeW7WFHcrxEVjkspOATp3Cabo5s-sLTOgtTgnXfwreu7b1rA@mail.gmail.com>
Subject: Re: M/R, Strange behavior with multiple Gzip files
From: x6i4uybz labs <x6i4uyzbz.labs@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=14dae9340ef1d27bdf04d031ee44

--14dae9340ef1d27bdf04d031ee44
Content-Type: text/plain; charset=ISO-8859-1

If it's common to see 0%-100% jumps, my job runs normally.
It's OK for me. Thanks for your answers


On Thu, Dec 6, 2012 at 5:39 PM, Harsh J <harsh@cloudera.com> wrote:

> Ok, I can't tell about the performance of your map process, but it is
> sometimes common to see 0% -> 100% jumps in progressbars when working
> over compressed data - as the progress (in terms of data records
> processed overall) can't be perfectly determined. It might even be a
> bug recently fixed.
>
> If your counters are updating fast enough over the minute, then I'd
> assume all is well. The local job runner concerns come from the
> statements of yours that only one map seems to be running at one time,
> but perhaps thats not the case anymore?
>
> On Thu, Dec 6, 2012 at 9:55 PM, x6i4uybz labs <x6i4uyzbz.labs@gmail.com>
> wrote:
> > Thanks for your answers.
> >
> > I haven't yet the whole solution but I know :
> >   - the job is not running on a local TT
> >   - the map process is very slow
> >   - and the progress bar is not working proprely
> >
> > So, the map tasks are running in parallel (hadoop works :)) but I don't
> > understand why the progression of each map task stays at 0.
> >
> >
> >
> >
> >
> >
> > On Thu, Dec 6, 2012 at 3:48 PM, Harsh J <harsh@cloudera.com> wrote:
> >>
> >> I tend to agree with Jean-Marc's observation. If your job client logs
> >> a "LocalJobRunner" at any point, then that is most definitely your
> >> problem.
> >>
> >> Otherwise, if you feel you are facing a scheduling problem, then it
> >> may most likely be your scheduler configuration. For example,
> >> FairScheduler has a <maxMaps/> attribute over its pools that you can
> >> set to control maximum parallel use of slots for jobs using that pool,
> >> etc..
> >>
> >> On Thu, Dec 6, 2012 at 8:10 PM, x6i4uybz labs <x6i4uyzbz.labs@gmail.com
> >
> >> wrote:
> >> > Hello,
> >> >
> >> > The job isn't running in local mode. In fact, I think I have just a
> >> > problem
> >> > with the map task progression.
> >> > The counters of each map task are OK during the job execution whereas
> >> > the
> >> > progression of each map task stays at 0%.
> >> >
> >> >
> >> >
> >> > On Thu, Dec 6, 2012 at 1:34 PM, Jean-Marc Spaggiari
> >> > <jean-marc@spaggiari.org> wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> Have you configured the mapredsite.xml to tell where the job tracker
> >> >> is? If not, your job is running on the local jobtracker, running the
> >> >> tasks one by one.
> >> >>
> >> >> JM
> >> >>
> >> >> PS: I faced the same issue few weeks ago and got the exact same
> >> >> behaviour. This (above) solved the issue.
> >> >>
> >> >> 2012/12/6, x6i4uybz labs <x6i4uyzbz.labs@gmail.com>:
> >> >> > Sorry,
> >> >> >
> >> >> > I wrote a job M/R to process several gz files (about 2000). I've a
> 80
> >> >> > map
> >> >> > slots cluster
> >> >> > JT instantiates one map per gz file (not splittable, it's OK).
> >> >> >
> >> >> > The first 80 maps spawn. But after "initializing" state,  it seems
> >> >> > there
> >> >> > is
> >> >> > one map running. And when this map is finished, another one started
> >> >> > (not
> >> >> > 80
> >> >> > maps in parallel) and another is affected to the empty slot.
> >> >> >
> >> >> > I've also noticed, the first maps last more than one hour and the
> >> >> > last
> >> >> > maps
> >> >> > 50 seconds.
> >> >> > Each gz file is between 10mb and 100mb.
> >> >> >
> >> >> > I don't understand the behavior.
> >> >> > I will launch again the job to see if I've the same issue.
> >> >> >
> >> >> > thanks, gpo
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Wed, Dec 5, 2012 at 6:33 PM, Harsh J <harsh@cloudera.com>
> wrote:
> >> >> >
> >> >> >> Your problem isn't clear in your description - can you please
> >> >> >> rephrase/redefine in terms of what you are expecting vs. what you
> >> >> >> are
> >> >> >> observing.
> >> >> >>
> >> >> >> Also note that Gzip files are not splittable by nature of their
> >> >> >> codec
> >> >> >> algorithm, and hence a TextInputFormat over plain/regular Gzip
> files
> >> >> >> would end up spawning and/or processing one whole Gzip file via
> one
> >> >> >> mapper, instead of multiple mappers per file.
> >> >> >>
> >> >> >> On Wed, Dec 5, 2012 at 9:32 PM, x6i4uybz labs
> >> >> >> <x6i4uyzbz.labs@gmail.com>
> >> >> >> wrote:
> >> >> >> > Hi everybody,
> >> >> >> >
> >> >> >> > I have a M/R job which does a bulk import to hbase.
> >> >> >> > I have to process many gzip files (2800 x ~ 100mb)
> >> >> >> >
> >> >> >> > I don't understand why my job instanciates 80 maps but runs each
> >> >> >> > map
> >> >> >> > sequentialy like if there is only one big gz file.
> >> >> >> >
> >> >> >> > Is there a problem in my driver ? Or maybe I miss something.
> >> >> >> > I use "FileInputFormat.addInputPath(job, new Path(args[0]))"
> where
> >> >> >> args[0]
> >> >> >> > is a directory.
> >> >> >> >
> >> >> >> > Can you help me, please ?
> >> >> >> >
> >> >> >> > Thanks, Guillaume
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Harsh J
> >> >> >>
> >> >> >
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

--14dae9340ef1d27bdf04d031ee44
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

If it&#39;s common to see 0%-100% jumps, my job runs normally.=A0<div>It=
9;s OK for me.=A0Thanks for your answers<div><div><div><br></div></div></di=
v></div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Th=
u, Dec 6, 2012 at 5:39 PM, Harsh J <span dir=3D"ltr">&lt;<a href=3D"mailto:=
harsh@cloudera.com" target=3D"_blank">harsh@cloudera.com</a>&gt;</span> wro=
te:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Ok, I can&#39;t tell about the performance o=
f your map process, but it is<br>
sometimes common to see 0% -&gt; 100% jumps in progressbars when working<br=
>
over compressed data - as the progress (in terms of data records<br>
processed overall) can&#39;t be perfectly determined. It might even be a<br=
>
bug recently fixed.<br>
<br>
If your counters are updating fast enough over the minute, then I&#39;d<br>
assume all is well. The local job runner concerns come from the<br>
statements of yours that only one map seems to be running at one time,<br>
but perhaps thats not the case anymore?<br>
<div class=3D"HOEnZb"><div class=3D"h5"><br>
On Thu, Dec 6, 2012 at 9:55 PM, x6i4uybz labs &lt;<a href=3D"mailto:x6i4uyz=
bz.labs@gmail.com">x6i4uyzbz.labs@gmail.com</a>&gt; wrote:<br>
&gt; Thanks for your answers.<br>
&gt;<br>
&gt; I haven&#39;t yet the whole solution but I know :<br>
&gt; =A0 - the job is not running on a local TT<br>
&gt; =A0 - the map process is very slow<br>
&gt; =A0 - and the progress bar is not working proprely<br>
&gt;<br>
&gt; So, the map tasks are running in parallel (hadoop works :)) but I don&=
#39;t<br>
&gt; understand why the progression of each map task stays at 0.<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; On Thu, Dec 6, 2012 at 3:48 PM, Harsh J &lt;<a href=3D"mailto:harsh@cl=
oudera.com">harsh@cloudera.com</a>&gt; wrote:<br>
&gt;&gt;<br>
&gt;&gt; I tend to agree with Jean-Marc&#39;s observation. If your job clie=
nt logs<br>
&gt;&gt; a &quot;LocalJobRunner&quot; at any point, then that is most defin=
itely your<br>
&gt;&gt; problem.<br>
&gt;&gt;<br>
&gt;&gt; Otherwise, if you feel you are facing a scheduling problem, then i=
t<br>
&gt;&gt; may most likely be your scheduler configuration. For example,<br>
&gt;&gt; FairScheduler has a &lt;maxMaps/&gt; attribute over its pools that=
 you can<br>
&gt;&gt; set to control maximum parallel use of slots for jobs using that p=
ool,<br>
&gt;&gt; etc..<br>
&gt;&gt;<br>
&gt;&gt; On Thu, Dec 6, 2012 at 8:10 PM, x6i4uybz labs &lt;<a href=3D"mailt=
o:x6i4uyzbz.labs@gmail.com">x6i4uyzbz.labs@gmail.com</a>&gt;<br>
&gt;&gt; wrote:<br>
&gt;&gt; &gt; Hello,<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; The job isn&#39;t running in local mode. In fact, I think I h=
ave just a<br>
&gt;&gt; &gt; problem<br>
&gt;&gt; &gt; with the map task progression.<br>
&gt;&gt; &gt; The counters of each map task are OK during the job execution=
 whereas<br>
&gt;&gt; &gt; the<br>
&gt;&gt; &gt; progression of each map task stays at 0%.<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; On Thu, Dec 6, 2012 at 1:34 PM, Jean-Marc Spaggiari<br>
&gt;&gt; &gt; &lt;<a href=3D"mailto:jean-marc@spaggiari.org">jean-marc@spag=
giari.org</a>&gt; wrote:<br>
&gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; Hi,<br>
&gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; Have you configured the mapredsite.xml to tell where the =
job tracker<br>
&gt;&gt; &gt;&gt; is? If not, your job is running on the local jobtracker, =
running the<br>
&gt;&gt; &gt;&gt; tasks one by one.<br>
&gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; JM<br>
&gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; PS: I faced the same issue few weeks ago and got the exac=
t same<br>
&gt;&gt; &gt;&gt; behaviour. This (above) solved the issue.<br>
&gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; 2012/12/6, x6i4uybz labs &lt;<a href=3D"mailto:x6i4uyzbz.=
labs@gmail.com">x6i4uyzbz.labs@gmail.com</a>&gt;:<br>
&gt;&gt; &gt;&gt; &gt; Sorry,<br>
&gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt; &gt;&gt; &gt; I wrote a job M/R to process several gz files (about=
 2000). I&#39;ve a 80<br>
&gt;&gt; &gt;&gt; &gt; map<br>
&gt;&gt; &gt;&gt; &gt; slots cluster<br>
&gt;&gt; &gt;&gt; &gt; JT instantiates one map per gz file (not splittable,=
 it&#39;s OK).<br>
&gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt; &gt;&gt; &gt; The first 80 maps spawn. But after &quot;initializin=
g&quot; state, =A0it seems<br>
&gt;&gt; &gt;&gt; &gt; there<br>
&gt;&gt; &gt;&gt; &gt; is<br>
&gt;&gt; &gt;&gt; &gt; one map running. And when this map is finished, anot=
her one started<br>
&gt;&gt; &gt;&gt; &gt; (not<br>
&gt;&gt; &gt;&gt; &gt; 80<br>
&gt;&gt; &gt;&gt; &gt; maps in parallel) and another is affected to the emp=
ty slot.<br>
&gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt; &gt;&gt; &gt; I&#39;ve also noticed, the first maps last more than=
 one hour and the<br>
&gt;&gt; &gt;&gt; &gt; last<br>
&gt;&gt; &gt;&gt; &gt; maps<br>
&gt;&gt; &gt;&gt; &gt; 50 seconds.<br>
&gt;&gt; &gt;&gt; &gt; Each gz file is between 10mb and 100mb.<br>
&gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt; &gt;&gt; &gt; I don&#39;t understand the behavior.<br>
&gt;&gt; &gt;&gt; &gt; I will launch again the job to see if I&#39;ve the s=
ame issue.<br>
&gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt; &gt;&gt; &gt; thanks, gpo<br>
&gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt; &gt;&gt; &gt; On Wed, Dec 5, 2012 at 6:33 PM, Harsh J &lt;<a href=
=3D"mailto:harsh@cloudera.com">harsh@cloudera.com</a>&gt; wrote:<br>
&gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt; &gt;&gt; &gt;&gt; Your problem isn&#39;t clear in your description=
 - can you please<br>
&gt;&gt; &gt;&gt; &gt;&gt; rephrase/redefine in terms of what you are expec=
ting vs. what you<br>
&gt;&gt; &gt;&gt; &gt;&gt; are<br>
&gt;&gt; &gt;&gt; &gt;&gt; observing.<br>
&gt;&gt; &gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; &gt;&gt; Also note that Gzip files are not splittable by =
nature of their<br>
&gt;&gt; &gt;&gt; &gt;&gt; codec<br>
&gt;&gt; &gt;&gt; &gt;&gt; algorithm, and hence a TextInputFormat over plai=
n/regular Gzip files<br>
&gt;&gt; &gt;&gt; &gt;&gt; would end up spawning and/or processing one whol=
e Gzip file via one<br>
&gt;&gt; &gt;&gt; &gt;&gt; mapper, instead of multiple mappers per file.<br=
>
&gt;&gt; &gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; &gt;&gt; On Wed, Dec 5, 2012 at 9:32 PM, x6i4uybz labs<br=
>
&gt;&gt; &gt;&gt; &gt;&gt; &lt;<a href=3D"mailto:x6i4uyzbz.labs@gmail.com">=
x6i4uyzbz.labs@gmail.com</a>&gt;<br>
&gt;&gt; &gt;&gt; &gt;&gt; wrote:<br>
&gt;&gt; &gt;&gt; &gt;&gt; &gt; Hi everybody,<br>
&gt;&gt; &gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt; &gt;&gt; &gt;&gt; &gt; I have a M/R job which does a bulk import t=
o hbase.<br>
&gt;&gt; &gt;&gt; &gt;&gt; &gt; I have to process many gzip files (2800 x ~=
 100mb)<br>
&gt;&gt; &gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt; &gt;&gt; &gt;&gt; &gt; I don&#39;t understand why my job instancia=
tes 80 maps but runs each<br>
&gt;&gt; &gt;&gt; &gt;&gt; &gt; map<br>
&gt;&gt; &gt;&gt; &gt;&gt; &gt; sequentialy like if there is only one big g=
z file.<br>
&gt;&gt; &gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt; &gt;&gt; &gt;&gt; &gt; Is there a problem in my driver ? Or maybe =
I miss something.<br>
&gt;&gt; &gt;&gt; &gt;&gt; &gt; I use &quot;FileInputFormat.addInputPath(jo=
b, new Path(args[0]))&quot; where<br>
&gt;&gt; &gt;&gt; &gt;&gt; args[0]<br>
&gt;&gt; &gt;&gt; &gt;&gt; &gt; is a directory.<br>
&gt;&gt; &gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt; &gt;&gt; &gt;&gt; &gt; Can you help me, please ?<br>
&gt;&gt; &gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt; &gt;&gt; &gt;&gt; &gt; Thanks, Guillaume<br>
&gt;&gt; &gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; &gt;&gt; --<br>
&gt;&gt; &gt;&gt; &gt;&gt; Harsh J<br>
&gt;&gt; &gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; &gt;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; --<br>
&gt;&gt; Harsh J<br>
&gt;<br>
&gt;<br>
<br>
<br>
<br>
</div></div><span class=3D"HOEnZb"><font color=3D"#888888">--<br>
Harsh J<br>
</font></span></blockquote></div><br></div>

--14dae9340ef1d27bdf04d031ee44--