Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of stransky.ja@gmail.com
 designates 209.85.215.51 as permitted sender)
MIME-Version: 1.0
Date: Thu, 11 Sep 2014 18:35:41 +0200
Message-ID: 
 <CAJOOh6GfzSnyg8-mV0E-YrV+HAj3Zs+5MEh_BPvLV9ubKAz+qA@mail.gmail.com>
Subject: task slowness
From: Jakub Stransky <stransky.ja@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=089e0160bbee5df91c0502ccc209

--089e0160bbee5df91c0502ccc209
Content-Type: text/plain; charset=UTF-8

Hello experienced hadoop users,

I am having a data pipeline consisting of two java MR jobs coordinated by
oozie scheduler. Both of them process the same data but the first one is
more than 10 times slower than second one. Job counters on RM page are not
much helpful in that matter. I have verified from our monitoring system
that there were no constraints on hw like IO, CPU, network etc.
Specifically it was using just a fraction of allowed resources designated
to given container.

Is there a way to get some profiling statistics out of hadoop cluster task?
What are the best available tools, required settings etc.

I have read a Hadoop definitive guide - job tunning but not sure that those
settings are still valid for hadoop 2.2.0.

Could someone refer to some good resource where to look for informatio e.g.
blog, manual, book etc.. I am a bit confused what refers to hadoop 1 and
what's are the settings for hadoop 2 mr 2.

Dataset size is around 500MB compressed, and it is map only task

Thanks for any experience shared
Jakub

--

--089e0160bbee5df91c0502ccc209
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hello experienced hadoop users,<div><br></div><div>I am ha=
ving a data pipeline consisting of two java MR jobs coordinated by oozie sc=
heduler. Both of them process the same data but the first one is more than =
10 times slower than second one. Job counters on RM page are not much helpf=
ul in that matter. I have verified from our monitoring system that there we=
re no constraints on hw like IO, CPU, network etc. Specifically it was usin=
g just a fraction of allowed resources designated to given container.</div>=
<div><br></div><div>Is there a way to get some profiling statistics out of =
hadoop cluster task? What are the best available tools, required settings e=
tc.</div><div><br></div><div>I have read a Hadoop definitive guide - job tu=
nning but not sure that those settings are still valid for hadoop 2.2.0.=C2=
=A0</div><div><br></div><div>Could someone refer to some good resource wher=
e to look for informatio e.g. blog, manual, book etc.. I am a bit confused =
what refers to hadoop 1 and what&#39;s are the settings for hadoop 2 mr 2.<=
/div><div><br></div><div>Dataset size is around 500MB compressed, and it is=
 map only task</div><div><br></div><div>Thanks for any experience shared</d=
iv><div>Jakub=C2=A0<br clear=3D"all"><div><br></div>-- <br><div dir=3D"ltr"=
><div style=3D"color:rgb(136,136,136);font-family:arial,sans-serif;font-siz=
e:13.333333969116211px;background-color:rgb(255,255,255)"><br></div></div>
</div></div>

--089e0160bbee5df91c0502ccc209--