Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of yypvsxf19870706@gmail.com
 designates 209.85.216.50 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAEVUn+DF_tVJDzaSaMmW0cNy2gXZvjXTXy3KeMaZAjknRuUV+g@mail.gmail.com>
References: 
 <CAMWNowEvSV+Sbx7Pfe5m182ckQ=OOed13SPjboMd04=RnzyLHw@mail.gmail.com>
	<CALckxSMkEC8aQVFn_f4UkoetHq_9Xz1z3CDGSa-SkxwJPVHdHg@mail.gmail.com>
	<CAMWNowEsk=3mE4NHQ6ogfxNN21LvLRR7NP+Txw+xaBpoCEqzxQ@mail.gmail.com>
	<CALckxSN4TmdmJHR9wBCFd=aFpe8b+PMy-SFEoPoF3c_FYhmSkg@mail.gmail.com>
	<CAMWNowEjMx8b-n7tPeccx1AwXAa4j4kXUaJhfpAxCh6i0=dVGw@mail.gmail.com>
	<CAEVUn+DF_tVJDzaSaMmW0cNy2gXZvjXTXy3KeMaZAjknRuUV+g@mail.gmail.com>
Date: Sat, 2 Mar 2013 10:36:52 +0800
Message-ID: 
 <CAG1CohQ8hUL3eiF=vHQNOprb1ZxdssmpyQ=yBLvzv7tcA2ovUw@mail.gmail.com>
Subject: Re: map stucks at 99.99%
From: YouPeng Yang <yypvsxf19870706@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=002354333326161b5904d6e7ffd5

--002354333326161b5904d6e7ffd5
Content-Type: text/plain; charset=ISO-8859-1

Hi Patai
   I found a similar explanation on the google mapreduce publication.

http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/zh-CN//archive/mapreduce-osdi04.pdf

   Please refere to the chapter:3.6 Backup Tasks

Hope to be helpful

regards


2013/3/1 Matt Davies <matt@mattdavies.net>

> I've seen this before if the input data stream changes suddenly and does
> not lend itself to parallelization such as counting the number of tuples in
> a bag.
>
> One think that may be interesting are the job counters from a previous job
> vs this job that just completed.  Do they differ? Is there a particular
> mapper that seems to have counts that are way out of whack?
>
> Has someone tweaked the production job in one way or another?
>
>
>
>
> On Thu, Feb 28, 2013 at 1:28 PM, Patai Sangbutsarakum <
> silvianhadoop@gmail.com> wrote:
>
>> > What type of CPU is on the box ? load average seems pretty high for a
>> 8-core
>> > box.
>> Xeon 3.07GHz, 24 cores
>>
>> > Do you have ganglia on these boxes ? Is the load average always so high?
>> > What's the memory usage for the task and overall on the box ?
>> From top -p pid of the task
>> CPU 143.2%  MEM 1.7%
>> So, it is not mem dried up on her, cpu is pretty pecked.
>>
>> >
>> > How long has the map task been running in that stuck state ?
>> --> at least 2 hours.
>>
>>
>> It finally just finished after hours, it double on time used today.. T_T
>>
>>
>>
>>
>>
>>
>> On Thu, Feb 28, 2013 at 1:18 PM, Viral Bajaria <viral.bajaria@gmail.com>
>> wrote:
>> > What type of CPU is on the box ? load average seems pretty high for a
>> 8-core
>> > box. Do you have ganglia on these boxes ? Is the load average always so
>> high
>> > ? What's the memory usage for the task and overall on the box ?
>> >
>> > How long has the map task been running in that stuck state ? If it's
>> been a
>> > few minutes, I am surprised that the JT didn't try to run it on another
>> node
>> > or have you switched off speculative execution ?
>> >
>> > Sorry too many questions !!
>> >
>> > You can try jstack, jmap. That will atleast tell you about what's
>> getting
>> > blocked.
>> >
>> > On Thu, Feb 28, 2013 at 1:04 PM, Patai Sangbutsarakum
>> > <silvianhadoop@gmail.com> wrote:
>> >>
>> >> - Check the box on which the task is running, is it under heavy load ?
>> >> Is there high amount of I/O wait ?
>> >> CPU, very warm load average: 47.47, 48.56, 49.00
>> >> I/O, chill on io 0.1x % on iowait, less than 20 tps, rarely upto
>> >> 100tps, on 10 disks jbod.
>> >>
>> >>
>> >> - You could check the task logs and see if they say anything about
>> >> what is going wrong ?
>> >> I would say no.. pretty much all of them is INFO
>> >>
>> >> - Did the task get pre-empted to other task trackers ? If yes, is it
>> >> stuck at the same spot on those ?
>> >> Nope.
>> >>
>> >> - What kind of work are you doing in the mapper ? Just reading from
>> >> HDFS and compute something or reading/writing from HBase ?
>> >> HDFS + compute, R/W
>> >> Absolutely no HBase.
>> >>
>> >> Would jstack, jmap be any useful ?
>> >>
>> >>
>> >> > - You could check the task logs and see if they say anything about
>> what
>> >> > is
>> >> > going wrong ?
>> >> > - Did the task get pre-empted to other task trackers ? If yes, is it
>> >> > stuck
>> >> > at the same spot on those ?
>> >> > - What kind of work are you doing in the mapper ? Just reading from
>> HDFS
>> >> > and
>> >> > compute something or reading/writing from HBase ?
>> >>
>> >> On Thu, Feb 28, 2013 at 12:25 PM, Viral Bajaria <
>> viral.bajaria@gmail.com>
>> >> wrote:
>> >> > You could start off doing the following:
>> >> >
>> >> > - Check the box on which the task is running, is it under heavy load
>> ?
>> >> > Is
>> >> > there high amount of I/O wait ?
>> >> > - You could check the task logs and see if they say anything about
>> what
>> >> > is
>> >> > going wrong ?
>> >> > - Did the task get pre-empted to other task trackers ? If yes, is it
>> >> > stuck
>> >> > at the same spot on those ?
>> >> > - What kind of work are you doing in the mapper ? Just reading from
>> HDFS
>> >> > and
>> >> > compute something or reading/writing from HBase ?
>> >> >
>> >> > Thanks,
>> >> > Viral
>> >> >
>> >> > On Thu, Feb 28, 2013 at 12:06 PM, Patai Sangbutsarakum
>> >> > <silvianhadoop@gmail.com> wrote:
>> >> >>
>> >> >> Hadoopers!!
>> >> >>
>> >> >> Need input from you guys,
>> >> >> i am looking at a critical job in production. it stucks at 99.99% in
>> >> >> map phrase for much longer than it used to be..
>> >> >>
>> >> >> what to do to debug what is going on with those map why it is not
>> pass
>> >> >> through
>> >> >> even though tasks and task attempts saying 100% progress but there
>> is
>> >> >> not finish time...
>> >> >>
>> >> >> Please suggest
>> >> >> Patai
>> >> >
>> >> >
>> >
>> >
>>
>
>

--002354333326161b5904d6e7ffd5
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Patai<div style>=A0 =A0I found a similar explanation on=
 the google mapreduce publication.</div><div style>=A0<a href=3D"http://sta=
tic.googleusercontent.com/external_content/untrusted_dlcp/research.google.c=
om/zh-CN//archive/mapreduce-osdi04.pdf">http://static.googleusercontent.com=
/external_content/untrusted_dlcp/research.google.com/zh-CN//archive/mapredu=
ce-osdi04.pdf</a><br>
</div><div style><br></div><div style>=A0 =A0Please refere to the chapter:3=
.6 Backup Tasks<br></div><div>=A0=A0</div><div style>Hope to be helpful</di=
v><div style><br></div><div style>regards</div><div style><br></div></div><=
div class=3D"gmail_extra">
<br><br><div class=3D"gmail_quote">2013/3/1 Matt Davies <span dir=3D"ltr">&=
lt;<a href=3D"mailto:matt@mattdavies.net" target=3D"_blank">matt@mattdavies=
.net</a>&gt;</span><br><blockquote class=3D"gmail_quote" style=3D"margin:0 =
0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir=3D"ltr">I&#39;ve seen this before if the input data stream changes=
 suddenly and does not lend itself to parallelization such as counting the =
number of tuples in a bag.<div><br></div><div>One think that may be interes=
ting are the job counters from a previous job vs this job that just complet=
ed. =A0Do they differ? Is there a particular mapper that seems to have coun=
ts that are way out of whack?</div>

<div><br></div><div>Has someone tweaked the production job in one way or an=
other?</div><div><br></div><div><br></div></div><div class=3D"HOEnZb"><div =
class=3D"h5"><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">=
On Thu, Feb 28, 2013 at 1:28 PM, Patai Sangbutsarakum <span dir=3D"ltr">&lt=
;<a href=3D"mailto:silvianhadoop@gmail.com" target=3D"_blank">silvianhadoop=
@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div>&gt; What type of CPU is on the box ? l=
oad average seems pretty high for a 8-core<br>
&gt; box.<br>
</div>Xeon 3.07GHz, 24 cores<br>
<div><br>
&gt; Do you have ganglia on these boxes ? Is the load average always so hig=
h?<br>
&gt; What&#39;s the memory usage for the task and overall on the box ?<br>
</div>From top -p pid of the task<br>
CPU 143.2% =A0MEM 1.7%<br>
So, it is not mem dried up on her, cpu is pretty pecked.<br>
<div><br>
&gt;<br>
&gt; How long has the map task been running in that stuck state ?<br>
</div>--&gt; at least 2 hours.<br>
<br>
<br>
It finally just finished after hours, it double on time used today.. T_T<br=
>
<div><div><br>
<br>
<br>
<br>
<br>
<br>
On Thu, Feb 28, 2013 at 1:18 PM, Viral Bajaria &lt;<a href=3D"mailto:viral.=
bajaria@gmail.com" target=3D"_blank">viral.bajaria@gmail.com</a>&gt; wrote:=
<br>
&gt; What type of CPU is on the box ? load average seems pretty high for a =
8-core<br>
&gt; box. Do you have ganglia on these boxes ? Is the load average always s=
o high<br>
&gt; ? What&#39;s the memory usage for the task and overall on the box ?<br=
>
&gt;<br>
&gt; How long has the map task been running in that stuck state ? If it&#39=
;s been a<br>
&gt; few minutes, I am surprised that the JT didn&#39;t try to run it on an=
other node<br>
&gt; or have you switched off speculative execution ?<br>
&gt;<br>
&gt; Sorry too many questions !!<br>
&gt;<br>
&gt; You can try jstack, jmap. That will atleast tell you about what&#39;s =
getting<br>
&gt; blocked.<br>
&gt;<br>
&gt; On Thu, Feb 28, 2013 at 1:04 PM, Patai Sangbutsarakum<br>
&gt; &lt;<a href=3D"mailto:silvianhadoop@gmail.com" target=3D"_blank">silvi=
anhadoop@gmail.com</a>&gt; wrote:<br>
&gt;&gt;<br>
&gt;&gt; - Check the box on which the task is running, is it under heavy lo=
ad ?<br>
&gt;&gt; Is there high amount of I/O wait ?<br>
&gt;&gt; CPU, very warm load average: 47.47, 48.56, 49.00<br>
&gt;&gt; I/O, chill on io 0.1x % on iowait, less than 20 tps, rarely upto<b=
r>
&gt;&gt; 100tps, on 10 disks jbod.<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; - You could check the task logs and see if they say anything about=
<br>
&gt;&gt; what is going wrong ?<br>
&gt;&gt; I would say no.. pretty much all of them is INFO<br>
&gt;&gt;<br>
&gt;&gt; - Did the task get pre-empted to other task trackers ? If yes, is =
it<br>
&gt;&gt; stuck at the same spot on those ?<br>
&gt;&gt; Nope.<br>
&gt;&gt;<br>
&gt;&gt; - What kind of work are you doing in the mapper ? Just reading fro=
m<br>
&gt;&gt; HDFS and compute something or reading/writing from HBase ?<br>
&gt;&gt; HDFS + compute, R/W<br>
&gt;&gt; Absolutely no HBase.<br>
&gt;&gt;<br>
&gt;&gt; Would jstack, jmap be any useful ?<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; &gt; - You could check the task logs and see if they say anything =
about what<br>
&gt;&gt; &gt; is<br>
&gt;&gt; &gt; going wrong ?<br>
&gt;&gt; &gt; - Did the task get pre-empted to other task trackers ? If yes=
, is it<br>
&gt;&gt; &gt; stuck<br>
&gt;&gt; &gt; at the same spot on those ?<br>
&gt;&gt; &gt; - What kind of work are you doing in the mapper ? Just readin=
g from HDFS<br>
&gt;&gt; &gt; and<br>
&gt;&gt; &gt; compute something or reading/writing from HBase ?<br>
&gt;&gt;<br>
&gt;&gt; On Thu, Feb 28, 2013 at 12:25 PM, Viral Bajaria &lt;<a href=3D"mai=
lto:viral.bajaria@gmail.com" target=3D"_blank">viral.bajaria@gmail.com</a>&=
gt;<br>
&gt;&gt; wrote:<br>
&gt;&gt; &gt; You could start off doing the following:<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; - Check the box on which the task is running, is it under hea=
vy load ?<br>
&gt;&gt; &gt; Is<br>
&gt;&gt; &gt; there high amount of I/O wait ?<br>
&gt;&gt; &gt; - You could check the task logs and see if they say anything =
about what<br>
&gt;&gt; &gt; is<br>
&gt;&gt; &gt; going wrong ?<br>
&gt;&gt; &gt; - Did the task get pre-empted to other task trackers ? If yes=
, is it<br>
&gt;&gt; &gt; stuck<br>
&gt;&gt; &gt; at the same spot on those ?<br>
&gt;&gt; &gt; - What kind of work are you doing in the mapper ? Just readin=
g from HDFS<br>
&gt;&gt; &gt; and<br>
&gt;&gt; &gt; compute something or reading/writing from HBase ?<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Thanks,<br>
&gt;&gt; &gt; Viral<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; On Thu, Feb 28, 2013 at 12:06 PM, Patai Sangbutsarakum<br>
&gt;&gt; &gt; &lt;<a href=3D"mailto:silvianhadoop@gmail.com" target=3D"_bla=
nk">silvianhadoop@gmail.com</a>&gt; wrote:<br>
&gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; Hadoopers!!<br>
&gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; Need input from you guys,<br>
&gt;&gt; &gt;&gt; i am looking at a critical job in production. it stucks a=
t 99.99% in<br>
&gt;&gt; &gt;&gt; map phrase for much longer than it used to be..<br>
&gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; what to do to debug what is going on with those map why i=
t is not pass<br>
&gt;&gt; &gt;&gt; through<br>
&gt;&gt; &gt;&gt; even though tasks and task attempts saying 100% progress =
but there is<br>
&gt;&gt; &gt;&gt; not finish time...<br>
&gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; Please suggest<br>
&gt;&gt; &gt;&gt; Patai<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt;<br>
&gt;<br>
&gt;<br>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--002354333326161b5904d6e7ffd5--