Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of x6i4uyzbz.labs@gmail.com
 designates 209.85.210.176 as permitted sender)
MIME-Version: 1.0
Sender: gpolaert@gmail.com
In-Reply-To: 
 <CAOcnVr0-dvOiG1kna1dH_zoK2dd+5EXgyeTH_8=ya9mDMeXjQQ@mail.gmail.com>
References: 
 <CAPPeW7UaTC0JXz+8thrhE9SbV-HCC9qVvrXY3jNPKEUAyLwyww@mail.gmail.com>
	<CAOcnVr0-dvOiG1kna1dH_zoK2dd+5EXgyeTH_8=ya9mDMeXjQQ@mail.gmail.com>
Date: Thu, 6 Dec 2012 10:57:23 +0100
Message-ID: 
 <CAPPeW7UOVfyzg92zZm8duT=WyX1soNDh=WpskcnVpKwHc1GyRg@mail.gmail.com>
Subject: Re: M/R, Strange behavior with multiple Gzip files
From: x6i4uybz labs <x6i4uyzbz.labs@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=20cf30334b5f2960b604d02c20d6

--20cf30334b5f2960b604d02c20d6
Content-Type: text/plain; charset=ISO-8859-1

Sorry,

I wrote a job M/R to process several gz files (about 2000). I've a 80 map
slots cluster
JT instantiates one map per gz file (not splittable, it's OK).

The first 80 maps spawn. But after "initializing" state,  it seems there is
one map running. And when this map is finished, another one started (not 80
maps in parallel) and another is affected to the empty slot.

I've also noticed, the first maps last more than one hour and the last maps
50 seconds.
Each gz file is between 10mb and 100mb.

I don't understand the behavior.
I will launch again the job to see if I've the same issue.

thanks, gpo


On Wed, Dec 5, 2012 at 6:33 PM, Harsh J <harsh@cloudera.com> wrote:

> Your problem isn't clear in your description - can you please
> rephrase/redefine in terms of what you are expecting vs. what you are
> observing.
>
> Also note that Gzip files are not splittable by nature of their codec
> algorithm, and hence a TextInputFormat over plain/regular Gzip files
> would end up spawning and/or processing one whole Gzip file via one
> mapper, instead of multiple mappers per file.
>
> On Wed, Dec 5, 2012 at 9:32 PM, x6i4uybz labs <x6i4uyzbz.labs@gmail.com>
> wrote:
> > Hi everybody,
> >
> > I have a M/R job which does a bulk import to hbase.
> > I have to process many gzip files (2800 x ~ 100mb)
> >
> > I don't understand why my job instanciates 80 maps but runs each map
> > sequentialy like if there is only one big gz file.
> >
> > Is there a problem in my driver ? Or maybe I miss something.
> > I use "FileInputFormat.addInputPath(job, new Path(args[0]))" where
> args[0]
> > is a directory.
> >
> > Can you help me, please ?
> >
> > Thanks, Guillaume
>
>
>
> --
> Harsh J
>

--20cf30334b5f2960b604d02c20d6
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div>Sorry,</div><div><br></div><div>I wrote a job M/R to process several g=
z files (about 2000). I&#39;ve a 80 map slots cluster</div><div>JT instanti=
ates=A0<span style=3D"color:rgb(0,0,0);font-family:&#39;normal Verdana&#39;=
,Geneva,Arial,Helvetica,sans-serif;font-size:14px">one map per gz file (not=
 splittable, it&#39;s OK).</span></div>
<div><span style=3D"color:rgb(0,0,0);font-family:&#39;normal Verdana&#39;,G=
eneva,Arial,Helvetica,sans-serif;font-size:14px"><br></span></div><div><fon=
t color=3D"#000000" face=3D"normal Verdana, Geneva, Arial, Helvetica, sans-=
serif">The first 80 maps spawn.=A0</font><span style=3D"color:rgb(0,0,0);fo=
nt-family:&#39;normal Verdana&#39;,Geneva,Arial,Helvetica,sans-serif;font-s=
ize:14px">But after &quot;initializing&quot; state, =A0it seems there is on=
e map running. And when this map is finished, another one started (not 80 m=
aps in parallel) and another is affected to the empty slot.</span></div>
<div><span style=3D"color:rgb(0,0,0);font-family:&#39;normal Verdana&#39;,G=
eneva,Arial,Helvetica,sans-serif;font-size:14px"><br></span></div><div><spa=
n style=3D"color:rgb(0,0,0);font-family:&#39;normal Verdana&#39;,Geneva,Ari=
al,Helvetica,sans-serif;font-size:14px">I&#39;ve also noticed, the first ma=
ps last more than one hour and the last maps 50 seconds.</span></div>
<div><span style=3D"color:rgb(0,0,0);font-family:&#39;normal Verdana&#39;,G=
eneva,Arial,Helvetica,sans-serif;font-size:14px">Each gz file is between 10=
mb and 100mb.</span></div><div><span style=3D"color:rgb(0,0,0);font-family:=
&#39;normal Verdana&#39;,Geneva,Arial,Helvetica,sans-serif;font-size:14px">=
<br>
</span></div><div><span style=3D"color:rgb(0,0,0);font-family:&#39;normal V=
erdana&#39;,Geneva,Arial,Helvetica,sans-serif;font-size:14px">I don&#39;t u=
nderstand the behavior.</span></div><div><span style=3D"color:rgb(0,0,0);fo=
nt-family:&#39;normal Verdana&#39;,Geneva,Arial,Helvetica,sans-serif;font-s=
ize:14px">I will launch again the job to see if I&#39;ve the same issue.</s=
pan></div>
<div><span style=3D"color:rgb(0,0,0);font-family:&#39;normal Verdana&#39;,G=
eneva,Arial,Helvetica,sans-serif;font-size:14px"><br></span></div><div><fon=
t color=3D"#000000" face=3D"normal Verdana, Geneva, Arial, Helvetica, sans-=
serif">thanks, gpo</font></div>
<div><span style=3D"color:rgb(0,0,0);font-family:&#39;normal Verdana&#39;,G=
eneva,Arial,Helvetica,sans-serif;font-size:14px"><br></span></div><div><br>=
</div><div><br><div><br><div><div><br></div><div><br></div></div></div></di=
v>


<div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Wed, Dec 5=
, 2012 at 6:33 PM, Harsh J <span dir=3D"ltr">&lt;<a href=3D"mailto:harsh@cl=
oudera.com" target=3D"_blank">harsh@cloudera.com</a>&gt;</span> wrote:<br><=
blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px=
 #ccc solid;padding-left:1ex">
Your problem isn&#39;t clear in your description - can you please<br>
rephrase/redefine in terms of what you are expecting vs. what you are<br>
observing.<br>
<br>
Also note that Gzip files are not splittable by nature of their codec<br>
algorithm, and hence a TextInputFormat over plain/regular Gzip files<br>
would end up spawning and/or processing one whole Gzip file via one<br>
mapper, instead of multiple mappers per file.<br>
<div class=3D"HOEnZb"><div class=3D"h5"><br>
On Wed, Dec 5, 2012 at 9:32 PM, x6i4uybz labs &lt;<a href=3D"mailto:x6i4uyz=
bz.labs@gmail.com">x6i4uyzbz.labs@gmail.com</a>&gt; wrote:<br>
&gt; Hi everybody,<br>
&gt;<br>
&gt; I have a M/R job which does a bulk import to hbase.<br>
&gt; I have to process many gzip files (2800 x ~ 100mb)<br>
&gt;<br>
&gt; I don&#39;t understand why my job instanciates 80 maps but runs each m=
ap<br>
&gt; sequentialy like if there is only one big gz file.<br>
&gt;<br>
&gt; Is there a problem in my driver ? Or maybe I miss something.<br>
&gt; I use &quot;FileInputFormat.addInputPath(job, new Path(args[0]))&quot;=
 where args[0]<br>
&gt; is a directory.<br>
&gt;<br>
&gt; Can you help me, please ?<br>
&gt;<br>
&gt; Thanks, Guillaume<br>
<br>
<br>
<br>
</div></div><span class=3D"HOEnZb"><font color=3D"#888888">--<br>
Harsh J<br>
</font></span></blockquote></div><br></div>

--20cf30334b5f2960b604d02c20d6--