Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of x6i4uyzbz.labs@gmail.com
 designates 209.85.223.176 as permitted sender)
MIME-Version: 1.0
Sender: gpolaert@gmail.com
In-Reply-To: 
 <CAPQV63X9tRMc+BZ5Rbkn0MSAu_e=qL7x-_nzyRPiHBec2sMrNQ@mail.gmail.com>
References: 
 <CAPPeW7UaTC0JXz+8thrhE9SbV-HCC9qVvrXY3jNPKEUAyLwyww@mail.gmail.com>
	<CAOcnVr0-dvOiG1kna1dH_zoK2dd+5EXgyeTH_8=ya9mDMeXjQQ@mail.gmail.com>
	<CAPPeW7UOVfyzg92zZm8duT=WyX1soNDh=WpskcnVpKwHc1GyRg@mail.gmail.com>
	<CAPQV63X9tRMc+BZ5Rbkn0MSAu_e=qL7x-_nzyRPiHBec2sMrNQ@mail.gmail.com>
Date: Thu, 6 Dec 2012 15:40:51 +0100
Message-ID: 
 <CAPPeW7UEOVewGVb9AYq_ibVe+D9ZOy6mD0F_Fsn=u3fpYFhe2A@mail.gmail.com>
Subject: Re: M/R, Strange behavior with multiple Gzip files
From: x6i4uybz labs <x6i4uyzbz.labs@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=20cf30334b5fee641904d0301573

--20cf30334b5fee641904d0301573
Content-Type: text/plain; charset=ISO-8859-1

Hello,

The job isn't running in local mode. In fact, I think I have just a problem
with the map task progression.
The counters of each map task are OK during the job execution whereas the
progression of each map task stays at 0%.


On Thu, Dec 6, 2012 at 1:34 PM, Jean-Marc Spaggiari <jean-marc@spaggiari.org
> wrote:

> Hi,
>
> Have you configured the mapredsite.xml to tell where the job tracker
> is? If not, your job is running on the local jobtracker, running the
> tasks one by one.
>
> JM
>
> PS: I faced the same issue few weeks ago and got the exact same
> behaviour. This (above) solved the issue.
>
> 2012/12/6, x6i4uybz labs <x6i4uyzbz.labs@gmail.com>:
> > Sorry,
> >
> > I wrote a job M/R to process several gz files (about 2000). I've a 80 map
> > slots cluster
> > JT instantiates one map per gz file (not splittable, it's OK).
> >
> > The first 80 maps spawn. But after "initializing" state,  it seems there
> is
> > one map running. And when this map is finished, another one started (not
> 80
> > maps in parallel) and another is affected to the empty slot.
> >
> > I've also noticed, the first maps last more than one hour and the last
> maps
> > 50 seconds.
> > Each gz file is between 10mb and 100mb.
> >
> > I don't understand the behavior.
> > I will launch again the job to see if I've the same issue.
> >
> > thanks, gpo
> >
> >
> >
> >
> >
> >
> >
> >
> > On Wed, Dec 5, 2012 at 6:33 PM, Harsh J <harsh@cloudera.com> wrote:
> >
> >> Your problem isn't clear in your description - can you please
> >> rephrase/redefine in terms of what you are expecting vs. what you are
> >> observing.
> >>
> >> Also note that Gzip files are not splittable by nature of their codec
> >> algorithm, and hence a TextInputFormat over plain/regular Gzip files
> >> would end up spawning and/or processing one whole Gzip file via one
> >> mapper, instead of multiple mappers per file.
> >>
> >> On Wed, Dec 5, 2012 at 9:32 PM, x6i4uybz labs <x6i4uyzbz.labs@gmail.com
> >
> >> wrote:
> >> > Hi everybody,
> >> >
> >> > I have a M/R job which does a bulk import to hbase.
> >> > I have to process many gzip files (2800 x ~ 100mb)
> >> >
> >> > I don't understand why my job instanciates 80 maps but runs each map
> >> > sequentialy like if there is only one big gz file.
> >> >
> >> > Is there a problem in my driver ? Or maybe I miss something.
> >> > I use "FileInputFormat.addInputPath(job, new Path(args[0]))" where
> >> args[0]
> >> > is a directory.
> >> >
> >> > Can you help me, please ?
> >> >
> >> > Thanks, Guillaume
> >>
> >>
> >>
> >> --
> >> Harsh J
> >>
> >
>

--20cf30334b5fee641904d0301573
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hello,<div><br><div>The job isn&#39;t running in local mode. In fact, I thi=
nk I have just a problem with the map task progression.<br>The counters of =
each map task are OK during the job execution whereas the progression of ea=
ch map task stays at 0%.</div>
</div><div><br></div><div class=3D"gmail_extra"><br><br><div class=3D"gmail=
_quote">On Thu, Dec 6, 2012 at 1:34 PM, Jean-Marc Spaggiari <span dir=3D"lt=
r">&lt;<a href=3D"mailto:jean-marc@spaggiari.org" target=3D"_blank">jean-ma=
rc@spaggiari.org</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hi,<br>
<br>
Have you configured the mapredsite.xml to tell where the job tracker<br>
is? If not, your job is running on the local jobtracker, running the<br>
tasks one by one.<br>
<br>
JM<br>
<br>
PS: I faced the same issue few weeks ago and got the exact same<br>
behaviour. This (above) solved the issue.<br>
<br>
2012/12/6, x6i4uybz labs &lt;<a href=3D"mailto:x6i4uyzbz.labs@gmail.com">x6=
i4uyzbz.labs@gmail.com</a>&gt;:<br>
<div class=3D"HOEnZb"><div class=3D"h5">&gt; Sorry,<br>
&gt;<br>
&gt; I wrote a job M/R to process several gz files (about 2000). I&#39;ve a=
 80 map<br>
&gt; slots cluster<br>
&gt; JT instantiates one map per gz file (not splittable, it&#39;s OK).<br>
&gt;<br>
&gt; The first 80 maps spawn. But after &quot;initializing&quot; state, =A0=
it seems there is<br>
&gt; one map running. And when this map is finished, another one started (n=
ot 80<br>
&gt; maps in parallel) and another is affected to the empty slot.<br>
&gt;<br>
&gt; I&#39;ve also noticed, the first maps last more than one hour and the =
last maps<br>
&gt; 50 seconds.<br>
&gt; Each gz file is between 10mb and 100mb.<br>
&gt;<br>
&gt; I don&#39;t understand the behavior.<br>
&gt; I will launch again the job to see if I&#39;ve the same issue.<br>
&gt;<br>
&gt; thanks, gpo<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; On Wed, Dec 5, 2012 at 6:33 PM, Harsh J &lt;<a href=3D"mailto:harsh@cl=
oudera.com">harsh@cloudera.com</a>&gt; wrote:<br>
&gt;<br>
&gt;&gt; Your problem isn&#39;t clear in your description - can you please<=
br>
&gt;&gt; rephrase/redefine in terms of what you are expecting vs. what you =
are<br>
&gt;&gt; observing.<br>
&gt;&gt;<br>
&gt;&gt; Also note that Gzip files are not splittable by nature of their co=
dec<br>
&gt;&gt; algorithm, and hence a TextInputFormat over plain/regular Gzip fil=
es<br>
&gt;&gt; would end up spawning and/or processing one whole Gzip file via on=
e<br>
&gt;&gt; mapper, instead of multiple mappers per file.<br>
&gt;&gt;<br>
&gt;&gt; On Wed, Dec 5, 2012 at 9:32 PM, x6i4uybz labs &lt;<a href=3D"mailt=
o:x6i4uyzbz.labs@gmail.com">x6i4uyzbz.labs@gmail.com</a>&gt;<br>
&gt;&gt; wrote:<br>
&gt;&gt; &gt; Hi everybody,<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; I have a M/R job which does a bulk import to hbase.<br>
&gt;&gt; &gt; I have to process many gzip files (2800 x ~ 100mb)<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; I don&#39;t understand why my job instanciates 80 maps but ru=
ns each map<br>
&gt;&gt; &gt; sequentialy like if there is only one big gz file.<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Is there a problem in my driver ? Or maybe I miss something.<=
br>
&gt;&gt; &gt; I use &quot;FileInputFormat.addInputPath(job, new Path(args[0=
]))&quot; where<br>
&gt;&gt; args[0]<br>
&gt;&gt; &gt; is a directory.<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Can you help me, please ?<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Thanks, Guillaume<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; --<br>
&gt;&gt; Harsh J<br>
&gt;&gt;<br>
&gt;<br>
</div></div></blockquote></div><br></div>

--20cf30334b5fee641904d0301573--