Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C36781160E for ; Thu, 27 Mar 2014 10:00:53 +0000 (UTC) Received: (qmail 22780 invoked by uid 500); 27 Mar 2014 10:00:39 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 22376 invoked by uid 500); 27 Mar 2014 10:00:35 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 22342 invoked by uid 99); 27 Mar 2014 10:00:31 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Mar 2014 10:00:31 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of krishnanjrao@gmail.com designates 209.85.223.180 as permitted sender) Received: from [209.85.223.180] (HELO mail-ie0-f180.google.com) (209.85.223.180) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Mar 2014 10:00:25 +0000 Received: by mail-ie0-f180.google.com with SMTP id as1so3171464iec.39 for ; Thu, 27 Mar 2014 03:00:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=QGLGHVmyxxI1acVUcPS4DULt4fBkGK9cO90huMcs9q8=; b=hgkoljGmHrg4mhus5gDD27JDE5S9DHqcK8OjJk+ePA8WbaBeoQSilU9BbP3lr+F8OV YAZs77TLEqxuycpbpC/NC4Vdyv+lA3gRQAW8q+UYnJ2MP1pwiSFISeDeovabszmsX4gl BfyaBULGeYHZWaJoizQXaGIw7UlVvZfH62iv2lGDjWZTRuGfDQeH1rUzNR4fX+qfqs8t Re0WwptJq1CLfKZiMmVZfXfwb4YAcM3Eq4kBuZy117XOt9xcrZpqFKylb4yv9sHkNE39 GOxMlOmfF2J4zcGWXBHXrlcwjkNrisrJ1Zm5IxHcy1Qu/m+JcMKV0/SpSC61n09cFn0D t21Q== X-Received: by 10.50.66.227 with SMTP id i3mr3045501igt.19.1395914404975; Thu, 27 Mar 2014 03:00:04 -0700 (PDT) MIME-Version: 1.0 Received: by 10.64.86.232 with HTTP; Thu, 27 Mar 2014 02:59:44 -0700 (PDT) In-Reply-To: References: From: Krishna Rao Date: Thu, 27 Mar 2014 09:59:44 +0000 Message-ID: Subject: Re: Job froze for hours because of an unresponsive disk on one of the task trackers To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=047d7bdc157a3ee1aa04f593a6ca X-Virus-Checked: Checked by ClamAV on apache.org --047d7bdc157a3ee1aa04f593a6ca Content-Type: text/plain; charset=ISO-8859-1 I noticed, but none of the jobs ended up being re-submitted! And all 3 of those jobs failed on the same node. All we know is that the disk on that node became unresponsive. On 27 March 2014 09:33, Dieter De Witte wrote: > The ids of the tasks are different so the node got killed after failing on > 3 different(!) reduce tasks. The reduce task 48 will probably have been > resubmitted to another node. > > > 2014-03-27 10:22 GMT+01:00 Krishna Rao : > > Hi, >> >> we have a daily Hive script that usually takes a few hours to run. The >> other day I notice one of the jobs was taking in excess of a few hours. >> Digging into it I saw that there were 3 attempts to launch a job on a >> single node: >> >> Task Id Start Time Finish Time >> Error >> task_201312241250_46714_r_000048 Error launching task >> task_201312241250_46714_r_000049 Error launching task >> task_201312241250_46714_r_000050 Error launching task >> >> I later found out that this node had a dodgy/unresponsive disk (still >> being tested right now). >> >> We've seen tasks fail in the past, but re-submitted to another node and >> succeeding. So, shouldn't this task have been kicked off on another node >> after the first failure? Is there anything I could be missing in terms of >> configuration that should be set? >> >> We're using CDH4.4.0. >> >> Cheers, >> >> Krishna >> > > --047d7bdc157a3ee1aa04f593a6ca Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
I noticed, but none of the jobs ended up being re-submitte= d! And all 3 of those jobs failed on the same node. All we know is that the= disk on that node became unresponsive.

On 27 March 2014 09:33, Dieter De Witte <d= rdwitte@gmail.com> wrote:
The ids of the tasks are different so the node got killed = after failing on 3 different(!) reduce tasks. The reduce task 48 will proba= bly have been resubmitted to another node.


2014-03-27 10:22 GMT+01:00 Krishna Rao <= span dir=3D"ltr"><krishnanjrao@gmail.com>:

Hi,

=
we have a = daily Hive script that usually takes a few hours to run. The other day I no= tice one of the jobs was taking in excess of a few hours. Digging into it I= saw that there were 3 attempts to launch a job on a single node:

Task Id Start Time Finish Time
Error
task_201312241250_46714_r_000048 Error launching task
task_201312241250_4= 6714_r_000049 Error launching= task
task_201312241250_46714_r_000050 = Error launching task

I later found out that this node had a dodgy/unresponsive disk (still being= tested right now).

We've seen tasks fail in the past, but re-submitted to another node and= succeeding. So, shouldn't this task have been kicked off on another no= de after the first failure? Is there anything I could be missing in terms o= f configuration that should be set?

We're using CDH4.4= .0.

Cheers,

Krishna


--047d7bdc157a3ee1aa04f593a6ca--