Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6B52F11184 for ; Sun, 20 Apr 2014 09:58:47 +0000 (UTC) Received: (qmail 89447 invoked by uid 500); 20 Apr 2014 09:58:39 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 88829 invoked by uid 500); 20 Apr 2014 09:58:37 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 88822 invoked by uid 99); 20 Apr 2014 09:58:36 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 20 Apr 2014 09:58:36 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of amitkabraiiit@gmail.com designates 209.85.219.47 as permitted sender) Received: from [209.85.219.47] (HELO mail-oa0-f47.google.com) (209.85.219.47) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 20 Apr 2014 09:58:32 +0000 Received: by mail-oa0-f47.google.com with SMTP id i11so3258315oag.20 for ; Sun, 20 Apr 2014 02:58:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to:content-type :content-transfer-encoding; bh=Vpo+uH/4uI8aeZ4Vfcq7BlNSFVEeb48rR8eFCk+x9g4=; b=gsV0i8y4xZaWWuNHl/uPhQa1u5z5rk79KYkdKz7pkWQ90hpRfwL0KpwjijxNVKg3AI 4h8Q25RK2uU3d/hzriGaIVs5pC/NodyHIsEb+C4y8NnLH7yAR5cagKateLI5Z9+zjzWA 02CFxQD+PnlcbOYde15UmZBvuNyD8cgdiLEg9km9xHGeY4aAaLzmzcDdGZ1E9vzyRHdE O85vFv3/QAnaEL0xk7y9FtWB71RqPg0H3nHcHhsFqvNlha6Ikxy8BrOe4C6wO9ftsHYf 7X9flDhjvonyzcehddsZdRdg1tVtlNZUl15IQ1nRlrQrdy1/xDjFcdYK8DOaWHFptLKL heng== X-Received: by 10.60.119.106 with SMTP id kt10mr26114474oeb.4.1397987889830; Sun, 20 Apr 2014 02:58:09 -0700 (PDT) MIME-Version: 1.0 Received: by 10.182.65.106 with HTTP; Sun, 20 Apr 2014 02:57:49 -0700 (PDT) From: Amit Kabra Date: Sun, 20 Apr 2014 15:27:49 +0530 Message-ID: Subject: All datanodes are bad. Aborting ... To: user@hadoop.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Hello, I am facing one issue where while running map reduce job ( terasort ), I see few task failing with the error "All datanodes 10.230.229.76:50010 are bad." The job however finishes successfully since failed tasks are spawned again. But yet I have seen these tasks failing multiple times. Test Setup : =3D=3D=3D=3D=3D=3D=3D=3D=3D Terasort for 300 GB , Number of reducers : 100, number of maps : 2400. Cluster is running fine ( before / after the error ) 12dn setup ( only zookeeper / hdfs / mapreduce running ) Map/Reduce container memory : 5gb Only few full gc's (2-3) that too with less than 0.02 sec pause time. Debugging: =3D=3D=3D=3D=3D=3D=3D=3D Error on console: 14/04/19 10:17:09 [main] INFO mapreduce.Job(1425): Task Id : attempt_1397901041097_0001_r_000025_0, Status : FAILED Error: java.io.IOException: All datanodes 10.230.229.76:50010 are bad. Abor= ting=E2=80=A6 atorg.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppen= dOrRecovery(DFSOutputStream.java:960) atorg.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(= DFSOutputStream.java:780) atorg.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.j= ava:449) Node Manager / Resource Manager shows no error. Datanode / NameNode shows following error Though job is finished , I cann't access my job since it gives the following "Not Found: job_1397901041097_0001".This on debugging found that it could be because of following lines in Resource Manager log which I am not sure why is it happening 19-Apr-2014 10:26:12 [1086033982@qtp-802928030-12] INFO org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet[327] - sfdc is accessing unchecked http://blitzhbase02-mnds1-1-crd.eng.sfdc.net:19888/jobhistory/job/job_13979= 01041097_0001/mapreduce/job/job_1397901041097_0001 which is the app master GUI of application_1397901041097_0001 owned by sfdc 19-Apr-2014 10:32:18 [501790067@qtp-802928030-22] INFO org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet[327] - sfdc is accessing unchecked http://blitzhbase02-mnds1-1-crd.eng.sfdc.net:19888/jobhistory/job/job_13979= 01041097_0001/mapreduce/job/job_1397901041097_0001 which is the app master GUI of application_1397901041097_0001 owned by sfdc 19-Apr-2014 10:33:43 [Delegation Token Canceler] INFO org.apache.hadoop.hdfs.DFSClient[898] - Cancelling HDFS_DELEGATION_TOKEN token 235 for sfdc on ha-hdfs:crd-dev-blitzhbase02 Has anyone seen this earlier or any input on this would be helpful. Note : Sometimes , I also see these errors , which seems to be known one. I restarted the cluster for this one. Amit.