Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E05E3EAB1 for ; Thu, 28 Feb 2013 21:18:38 +0000 (UTC) Received: (qmail 75688 invoked by uid 500); 28 Feb 2013 21:18:32 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 75537 invoked by uid 500); 28 Feb 2013 21:18:32 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 75530 invoked by uid 99); 28 Feb 2013 21:18:32 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Feb 2013 21:18:32 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of viral.bajaria@gmail.com designates 209.85.219.41 as permitted sender) Received: from [209.85.219.41] (HELO mail-oa0-f41.google.com) (209.85.219.41) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Feb 2013 21:18:23 +0000 Received: by mail-oa0-f41.google.com with SMTP id i10so4499044oag.28 for ; Thu, 28 Feb 2013 13:18:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=uvHUUY+qcb/U8ISYRGzzyQbji+2C36HpbEbPp4lo1Ww=; b=KJ9j4Sx5zYXBExN8Da/fLXhNJoJm6WZzsbzk/aiIxG7h36y6DdVt9DIt+JTe+6+x8e bZUP6xzd3vTuKU5XBJkobMNFnvFxa25PnwJgSA6iSbUkB4HyDmYCJAN7p/VtJa36+1Je Icjei8GyLV76PGxhK0vBah9rF/nCmtVK9k0TPFJMMDfF2hyMGbTLgUwYPHHtophdN5oI 9VQMfmqAXSC4TWle1C7f4ZiHh9dn1yGBKOT8mb6mik3blwAvdGkAVcQWzc8lPBVtkSG0 LOvVzT00IrSSA5CIgbARtSgFIc7t247xKmKaMDBxx/Bix36VKTDeURTS1AoKknxlhs4l ILRg== MIME-Version: 1.0 X-Received: by 10.60.24.72 with SMTP id s8mr6822533oef.68.1362086282941; Thu, 28 Feb 2013 13:18:02 -0800 (PST) Received: by 10.182.76.74 with HTTP; Thu, 28 Feb 2013 13:18:02 -0800 (PST) In-Reply-To: References: Date: Thu, 28 Feb 2013 13:18:02 -0800 Message-ID: Subject: Re: map stucks at 99.99% From: Viral Bajaria To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=e89a8ff1c8aa0c5e7f04d6cf6dc0 X-Virus-Checked: Checked by ClamAV on apache.org --e89a8ff1c8aa0c5e7f04d6cf6dc0 Content-Type: text/plain; charset=ISO-8859-1 What type of CPU is on the box ? load average seems pretty high for a 8-core box. Do you have ganglia on these boxes ? Is the load average always so high ? What's the memory usage for the task and overall on the box ? How long has the map task been running in that stuck state ? If it's been a few minutes, I am surprised that the JT didn't try to run it on another node or have you switched off speculative execution ? Sorry too many questions !! You can try jstack, jmap. That will atleast tell you about what's getting blocked. On Thu, Feb 28, 2013 at 1:04 PM, Patai Sangbutsarakum < silvianhadoop@gmail.com> wrote: > - Check the box on which the task is running, is it under heavy load ? > Is there high amount of I/O wait ? > CPU, very warm load average: 47.47, 48.56, 49.00 > I/O, chill on io 0.1x % on iowait, less than 20 tps, rarely upto > 100tps, on 10 disks jbod. > > > - You could check the task logs and see if they say anything about > what is going wrong ? > I would say no.. pretty much all of them is INFO > > - Did the task get pre-empted to other task trackers ? If yes, is it > stuck at the same spot on those ? > Nope. > > - What kind of work are you doing in the mapper ? Just reading from > HDFS and compute something or reading/writing from HBase ? > HDFS + compute, R/W > Absolutely no HBase. > > Would jstack, jmap be any useful ? > > > > - You could check the task logs and see if they say anything about what > is > > going wrong ? > > - Did the task get pre-empted to other task trackers ? If yes, is it > stuck > > at the same spot on those ? > > - What kind of work are you doing in the mapper ? Just reading from HDFS > and > > compute something or reading/writing from HBase ? > > On Thu, Feb 28, 2013 at 12:25 PM, Viral Bajaria > wrote: > > You could start off doing the following: > > > > - Check the box on which the task is running, is it under heavy load ? Is > > there high amount of I/O wait ? > > - You could check the task logs and see if they say anything about what > is > > going wrong ? > > - Did the task get pre-empted to other task trackers ? If yes, is it > stuck > > at the same spot on those ? > > - What kind of work are you doing in the mapper ? Just reading from HDFS > and > > compute something or reading/writing from HBase ? > > > > Thanks, > > Viral > > > > On Thu, Feb 28, 2013 at 12:06 PM, Patai Sangbutsarakum > > wrote: > >> > >> Hadoopers!! > >> > >> Need input from you guys, > >> i am looking at a critical job in production. it stucks at 99.99% in > >> map phrase for much longer than it used to be.. > >> > >> what to do to debug what is going on with those map why it is not pass > >> through > >> even though tasks and task attempts saying 100% progress but there is > >> not finish time... > >> > >> Please suggest > >> Patai > > > > > --e89a8ff1c8aa0c5e7f04d6cf6dc0 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable What type of CPU is on the box ? load average seems pretty high for a 8-cor= e box. Do you have ganglia on these boxes ? Is the load average always so h= igh ? What's the memory usage for the task and overall on the box ?
How long has the map task been running in that stuck state ?= If it's been a few minutes, I am surprised that the JT didn't try = to run it on another node or have you switched off speculative execution ?<= /div>

Sorry too many questions !!

Yo= u can try jstack, jmap. That will atleast tell you about what's getting= blocked.

On Thu, Feb 28, 2013 at = 1:04 PM, Patai Sangbutsarakum <silvianhadoop@gmail.com> wrote:
- Check the box on which t= he task is running, is it under heavy load ?
Is there high amount of I/O wait ?
CPU, very warm load average: 47.47, 48.56, 49.00
I/O, chill on io 0.1x % on iowait, less than 20 tps, rarely upto
100tps, on 10 disks jbod.


- You could check the task logs and see if they say anything about
what is going wrong ?
I would say no.. pretty much all of them is INFO

- Did the task get pre-empted to other task trackers ? If yes, is it
stuck at the same spot on those ?
Nope.

- What kind of work are you doing in the mapper ? Just reading from
HDFS and compute something or reading/writing from HBase ?
HDFS + compute, R/W
Absolutely no HBase.

Would jstack, jmap be any useful ?


> - You could check the task logs and see if they say anything about wha= t is
> going wrong ?
> - Did the task get pre-empted to other task trackers ? If yes, is it s= tuck
> at the same spot on those ?
> - What kind of work are you doing in the mapper ? Just reading from HD= FS and
> compute something or reading/writing from HBase ?

On Thu, Feb 28, 2013 at 12:25= PM, Viral Bajaria <viral.baj= aria@gmail.com> wrote:
> You could start off doing the following:
>
> - Check the box on which the task is running, is it under heavy load ?= Is
> there high amount of I/O wait ?
> - You could check the task logs and see if they say anything about wha= t is
> going wrong ?
> - Did the task get pre-empted to other task trackers ? If yes, is it s= tuck
> at the same spot on those ?
> - What kind of work are you doing in the mapper ? Just reading from HD= FS and
> compute something or reading/writing from HBase ?
>
> Thanks,
> Viral
>
> On Thu, Feb 28, 2013 at 12:06 PM, Patai Sangbutsarakum
> <silvianhadoop@gmail.com= > wrote:
>>
>> Hadoopers!!
>>
>> Need input from you guys,
>> i am looking at a critical job in production. it stucks at 99.99% = in
>> map phrase for much longer than it used to be..
>>
>> what to do to debug what is going on with those map why it is not = pass
>> through
>> even though tasks and task attempts saying 100% progress but there= is
>> not finish time...
>>
>> Please suggest
>> Patai
>
>

--e89a8ff1c8aa0c5e7f04d6cf6dc0--