Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CF2E7EB97 for ; Sat, 2 Mar 2013 02:37:25 +0000 (UTC) Received: (qmail 50709 invoked by uid 500); 2 Mar 2013 02:37:20 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 50567 invoked by uid 500); 2 Mar 2013 02:37:20 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 50545 invoked by uid 99); 2 Mar 2013 02:37:19 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Mar 2013 02:37:19 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of yypvsxf19870706@gmail.com designates 209.85.216.50 as permitted sender) Received: from [209.85.216.50] (HELO mail-qa0-f50.google.com) (209.85.216.50) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Mar 2013 02:37:13 +0000 Received: by mail-qa0-f50.google.com with SMTP id dx4so138946qab.9 for ; Fri, 01 Mar 2013 18:36:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=dh1a0KkB10wIYnI7jet+/qN4pBHzqDjJ6v1nSLEeS+o=; b=BZ9NsOjGsi+aJBGcPas4heVjCVIeuMq120ObY0YKrbF9Vzpjonnv5EcfOaiwnPEs+j IBmGpBJjKKQC81/J2f/z823Ida1tZGQA3lkJG1kULxseXdI+qyoOPUUVFOLpnkYHdaN2 ioPLgLFbEh0al3YtgiMuhJqbarKZodx4XKFu/XdviHowGX5OOOW8Xh13iuDvtr1bOAX5 b/YB7fGuB/+awBR63Gn/Xyw6X0Q2MtrqSErWxNTUkfsQCiNLSKRmxAAuIWARySLV1vVy 4YPljjp8GbmuIXZfjoWSrj1AKmDKDc6HrJUC1bvJ0W4lqGsRPdSmS811aO6l629B+Hsf oPQA== MIME-Version: 1.0 X-Received: by 10.229.78.80 with SMTP id j16mr4536217qck.87.1362191812266; Fri, 01 Mar 2013 18:36:52 -0800 (PST) Received: by 10.49.82.202 with HTTP; Fri, 1 Mar 2013 18:36:52 -0800 (PST) In-Reply-To: References: Date: Sat, 2 Mar 2013 10:36:52 +0800 Message-ID: Subject: Re: map stucks at 99.99% From: YouPeng Yang To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=002354333326161b5904d6e7ffd5 X-Virus-Checked: Checked by ClamAV on apache.org --002354333326161b5904d6e7ffd5 Content-Type: text/plain; charset=ISO-8859-1 Hi Patai I found a similar explanation on the google mapreduce publication. http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/zh-CN//archive/mapreduce-osdi04.pdf Please refere to the chapter:3.6 Backup Tasks Hope to be helpful regards 2013/3/1 Matt Davies > I've seen this before if the input data stream changes suddenly and does > not lend itself to parallelization such as counting the number of tuples in > a bag. > > One think that may be interesting are the job counters from a previous job > vs this job that just completed. Do they differ? Is there a particular > mapper that seems to have counts that are way out of whack? > > Has someone tweaked the production job in one way or another? > > > > > On Thu, Feb 28, 2013 at 1:28 PM, Patai Sangbutsarakum < > silvianhadoop@gmail.com> wrote: > >> > What type of CPU is on the box ? load average seems pretty high for a >> 8-core >> > box. >> Xeon 3.07GHz, 24 cores >> >> > Do you have ganglia on these boxes ? Is the load average always so high? >> > What's the memory usage for the task and overall on the box ? >> From top -p pid of the task >> CPU 143.2% MEM 1.7% >> So, it is not mem dried up on her, cpu is pretty pecked. >> >> > >> > How long has the map task been running in that stuck state ? >> --> at least 2 hours. >> >> >> It finally just finished after hours, it double on time used today.. T_T >> >> >> >> >> >> >> On Thu, Feb 28, 2013 at 1:18 PM, Viral Bajaria >> wrote: >> > What type of CPU is on the box ? load average seems pretty high for a >> 8-core >> > box. Do you have ganglia on these boxes ? Is the load average always so >> high >> > ? What's the memory usage for the task and overall on the box ? >> > >> > How long has the map task been running in that stuck state ? If it's >> been a >> > few minutes, I am surprised that the JT didn't try to run it on another >> node >> > or have you switched off speculative execution ? >> > >> > Sorry too many questions !! >> > >> > You can try jstack, jmap. That will atleast tell you about what's >> getting >> > blocked. >> > >> > On Thu, Feb 28, 2013 at 1:04 PM, Patai Sangbutsarakum >> > wrote: >> >> >> >> - Check the box on which the task is running, is it under heavy load ? >> >> Is there high amount of I/O wait ? >> >> CPU, very warm load average: 47.47, 48.56, 49.00 >> >> I/O, chill on io 0.1x % on iowait, less than 20 tps, rarely upto >> >> 100tps, on 10 disks jbod. >> >> >> >> >> >> - You could check the task logs and see if they say anything about >> >> what is going wrong ? >> >> I would say no.. pretty much all of them is INFO >> >> >> >> - Did the task get pre-empted to other task trackers ? If yes, is it >> >> stuck at the same spot on those ? >> >> Nope. >> >> >> >> - What kind of work are you doing in the mapper ? Just reading from >> >> HDFS and compute something or reading/writing from HBase ? >> >> HDFS + compute, R/W >> >> Absolutely no HBase. >> >> >> >> Would jstack, jmap be any useful ? >> >> >> >> >> >> > - You could check the task logs and see if they say anything about >> what >> >> > is >> >> > going wrong ? >> >> > - Did the task get pre-empted to other task trackers ? If yes, is it >> >> > stuck >> >> > at the same spot on those ? >> >> > - What kind of work are you doing in the mapper ? Just reading from >> HDFS >> >> > and >> >> > compute something or reading/writing from HBase ? >> >> >> >> On Thu, Feb 28, 2013 at 12:25 PM, Viral Bajaria < >> viral.bajaria@gmail.com> >> >> wrote: >> >> > You could start off doing the following: >> >> > >> >> > - Check the box on which the task is running, is it under heavy load >> ? >> >> > Is >> >> > there high amount of I/O wait ? >> >> > - You could check the task logs and see if they say anything about >> what >> >> > is >> >> > going wrong ? >> >> > - Did the task get pre-empted to other task trackers ? If yes, is it >> >> > stuck >> >> > at the same spot on those ? >> >> > - What kind of work are you doing in the mapper ? Just reading from >> HDFS >> >> > and >> >> > compute something or reading/writing from HBase ? >> >> > >> >> > Thanks, >> >> > Viral >> >> > >> >> > On Thu, Feb 28, 2013 at 12:06 PM, Patai Sangbutsarakum >> >> > wrote: >> >> >> >> >> >> Hadoopers!! >> >> >> >> >> >> Need input from you guys, >> >> >> i am looking at a critical job in production. it stucks at 99.99% in >> >> >> map phrase for much longer than it used to be.. >> >> >> >> >> >> what to do to debug what is going on with those map why it is not >> pass >> >> >> through >> >> >> even though tasks and task attempts saying 100% progress but there >> is >> >> >> not finish time... >> >> >> >> >> >> Please suggest >> >> >> Patai >> >> > >> >> > >> > >> > >> > > --002354333326161b5904d6e7ffd5 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hi Patai
=A0 =A0I found a similar explanation on= the google mapreduce publication.

=A0 =A0Please refere to the chapter:3= .6 Backup Tasks
=A0=A0
Hope to be helpful

regards

<= div class=3D"gmail_extra">

2013/3/1 Matt Davies &= lt;matt@mattdavies= .net>
I've seen this before if the input data stream changes= suddenly and does not lend itself to parallelization such as counting the = number of tuples in a bag.

One think that may be interes= ting are the job counters from a previous job vs this job that just complet= ed. =A0Do they differ? Is there a particular mapper that seems to have coun= ts that are way out of whack?

Has someone tweaked the production job in one way or an= other?




= On Thu, Feb 28, 2013 at 1:28 PM, Patai Sangbutsarakum <= ;silvianhadoop= @gmail.com> wrote:
> What type of CPU is on the box ? l= oad average seems pretty high for a 8-core
> box.
Xeon 3.07GHz, 24 cores

> Do you have ganglia on these boxes ? Is the load average always so hig= h?
> What's the memory usage for the task and overall on the box ?
From top -p pid of the task
CPU 143.2% =A0MEM 1.7%
So, it is not mem dried up on her, cpu is pretty pecked.

>
> How long has the map task been running in that stuck state ?
--> at least 2 hours.


It finally just finished after hours, it double on time used today.. T_T






On Thu, Feb 28, 2013 at 1:18 PM, Viral Bajaria <viral.bajaria@gmail.com> wrote:=
> What type of CPU is on the box ? load average seems pretty high for a = 8-core
> box. Do you have ganglia on these boxes ? Is the load average always s= o high
> ? What's the memory usage for the task and overall on the box ? >
> How long has the map task been running in that stuck state ? If it'= ;s been a
> few minutes, I am surprised that the JT didn't try to run it on an= other node
> or have you switched off speculative execution ?
>
> Sorry too many questions !!
>
> You can try jstack, jmap. That will atleast tell you about what's = getting
> blocked.
>
> On Thu, Feb 28, 2013 at 1:04 PM, Patai Sangbutsarakum
> <silvi= anhadoop@gmail.com> wrote:
>>
>> - Check the box on which the task is running, is it under heavy lo= ad ?
>> Is there high amount of I/O wait ?
>> CPU, very warm load average: 47.47, 48.56, 49.00
>> I/O, chill on io 0.1x % on iowait, less than 20 tps, rarely upto >> 100tps, on 10 disks jbod.
>>
>>
>> - You could check the task logs and see if they say anything about=
>> what is going wrong ?
>> I would say no.. pretty much all of them is INFO
>>
>> - Did the task get pre-empted to other task trackers ? If yes, is = it
>> stuck at the same spot on those ?
>> Nope.
>>
>> - What kind of work are you doing in the mapper ? Just reading fro= m
>> HDFS and compute something or reading/writing from HBase ?
>> HDFS + compute, R/W
>> Absolutely no HBase.
>>
>> Would jstack, jmap be any useful ?
>>
>>
>> > - You could check the task logs and see if they say anything = about what
>> > is
>> > going wrong ?
>> > - Did the task get pre-empted to other task trackers ? If yes= , is it
>> > stuck
>> > at the same spot on those ?
>> > - What kind of work are you doing in the mapper ? Just readin= g from HDFS
>> > and
>> > compute something or reading/writing from HBase ?
>>
>> On Thu, Feb 28, 2013 at 12:25 PM, Viral Bajaria <viral.bajaria@gmail.com&= gt;
>> wrote:
>> > You could start off doing the following:
>> >
>> > - Check the box on which the task is running, is it under hea= vy load ?
>> > Is
>> > there high amount of I/O wait ?
>> > - You could check the task logs and see if they say anything = about what
>> > is
>> > going wrong ?
>> > - Did the task get pre-empted to other task trackers ? If yes= , is it
>> > stuck
>> > at the same spot on those ?
>> > - What kind of work are you doing in the mapper ? Just readin= g from HDFS
>> > and
>> > compute something or reading/writing from HBase ?
>> >
>> > Thanks,
>> > Viral
>> >
>> > On Thu, Feb 28, 2013 at 12:06 PM, Patai Sangbutsarakum
>> > <silvianhadoop@gmail.com> wrote:
>> >>
>> >> Hadoopers!!
>> >>
>> >> Need input from you guys,
>> >> i am looking at a critical job in production. it stucks a= t 99.99% in
>> >> map phrase for much longer than it used to be..
>> >>
>> >> what to do to debug what is going on with those map why i= t is not pass
>> >> through
>> >> even though tasks and task attempts saying 100% progress = but there is
>> >> not finish time...
>> >>
>> >> Please suggest
>> >> Patai
>> >
>> >
>
>


--002354333326161b5904d6e7ffd5--