Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of dechouxb@gmail.com designates
 209.85.216.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CANJ6bM+EuhobwHfOF=UCr6HTX=5e2c_3QSkq0dNBOu05i-pcuw@mail.gmail.com>
References: 
 <CANJ6bM+EuhobwHfOF=UCr6HTX=5e2c_3QSkq0dNBOu05i-pcuw@mail.gmail.com>
Date: Mon, 13 Aug 2012 15:57:47 +0200
Message-ID: 
 <CAO6W-2fFgWB3drMMtuZBZaa196U6bkO2yYiJOBCKUqN1fM7UQQ@mail.gmail.com>
Subject: Re: how to enhance job start up speed?
From: Bertrand Dechoux <dechouxb@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=20cf3074d996230b2004c72614df

--20cf3074d996230b2004c72614df
Content-Type: text/plain; charset=ISO-8859-1

I am not sure to understand and I guess I am not the only one.

1) What's a worker in your context? Only the logic inside your Mapper or
something else?
2) You should clarify your cases. You seem to have two cases but both are
in overhead so I am assuming there is a baseline? Hadoop vs sequential, so
sequential is not Hadoop?
3) What are the size of the file?

Bertrand

On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
matthias.mk.kricke@gmail.com> wrote:

> Hello all,
>
> I'm using CDH3u3.
> If I want to process one File, set to non splitable hadoop starts one
> Mapper and no Reducer (thats ok for this test scenario). The Mapper
> goes through a configuration step where some variables for the worker
> inside the mapper are initialized.
> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
> process the V with the worker.
>
> When I compare the run time of hadoop to the run time of the same process
> in sequentiell manner, I get:
>
> worker time --> same in both cases
>
> case: mapper --> overhead of ~32% to the worker process (same for bigger
> chunk size)
> case: sequentiell --> overhead of ~15% to the worker process
>
> It shouldn't be that much slower, because of non splitable, the mapper
> will be executed where the data is saved by HDFS, won't it?
> Where did those 17% go? How to reduce this? Did hadoop needs the whole
> time for reading or streaming the data out of HDFS?
>
> I would appreciate your help,
>
> Greetings
> mk
>
>


-- 
Bertrand Dechoux

--20cf3074d996230b2004c72614df
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I am not sure to understand and I guess I am not the only one.<br><br>1) Wh=
at&#39;s a worker in your context? Only the logic inside your Mapper or som=
ething else?<br>2) You should clarify your cases. You seem to have two case=
s but both are in overhead so I am assuming there is a baseline? Hadoop vs =
sequential, so sequential is not Hadoop?<br>
3) What are the size of the file?<br><br>Bertrand<br><br><div class=3D"gmai=
l_quote">On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <span dir=3D"ltr"=
>&lt;<a href=3D"mailto:matthias.mk.kricke@gmail.com" target=3D"_blank">matt=
hias.mk.kricke@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hello all,
<div><br></div><div>I&#39;m using CDH3u3.</div><div>If I want to process on=
e File, set to non splitable hadoop starts one Mapper and no Reducer (thats=
 ok for this test scenario). The Mapper</div><div>goes through a configurat=
ion step where some variables for the worker inside the mapper are initiali=
zed.</div>

<div>Now the Mapper gives me K,V-pairs, which are lines of an input file. I=
 process the V with the worker.</div><div><br></div><div>When I compare the=
 run time of hadoop to the run time of the same process in sequentiell mann=
er, I get:</div>

<div><br></div><div>worker time --&gt; same in both cases</div><div><br></d=
iv><div>case: mapper --&gt; overhead of ~32% to the worker process (same fo=
r bigger chunk size)</div><div>case: sequentiell --&gt; overhead of ~15% to=
 the worker process</div>

<div><br></div><div>It shouldn&#39;t be that much slower, because of non sp=
litable, the mapper will be executed where the data is saved by HDFS, won&#=
39;t it?</div><div>Where did those 17% go? How to reduce this? Did hadoop n=
eeds the whole time for reading or streaming the data out of HDFS?</div>

<div><br></div><div>I would appreciate your help,</div><div><br></div><div>=
Greetings</div><span class=3D"HOEnZb"><font color=3D"#888888"><div>mk</div>=
<div><br></div>
</font></span></blockquote></div><br><br clear=3D"all"><br>-- <br>Bertrand =
Dechoux<br>

--20cf3074d996230b2004c72614df--