Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4AC2D18D3F for ; Sun, 15 Nov 2015 03:55:49 +0000 (UTC) Received: (qmail 497 invoked by uid 500); 15 Nov 2015 03:55:45 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 360 invoked by uid 500); 15 Nov 2015 03:55:45 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 349 invoked by uid 99); 15 Nov 2015 03:55:44 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 15 Nov 2015 03:55:44 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 44C6E1A0B44 for ; Sun, 15 Nov 2015 03:55:44 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.19 X-Spam-Level: **** X-Spam-Status: No, score=4.19 tagged_above=-999 required=6.31 tests=[DKIM_ADSP_CUSTOM_MED=0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.008, HTML_MESSAGE=3, NML_ADSP_CUSTOM_MED=1.2, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, WEIRD_PORT=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id Va6RvF90u4ON for ; Sun, 15 Nov 2015 03:55:37 +0000 (UTC) Received: from mail-ig0-f182.google.com (mail-ig0-f182.google.com [209.85.213.182]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 9308043DF7 for ; Sun, 15 Nov 2015 03:55:37 +0000 (UTC) Received: by igl9 with SMTP id 9so38828066igl.0 for ; Sat, 14 Nov 2015 19:55:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:reply-to:sender:from:date:message-id:subject:to :content-type; bh=e9nqFTV27QVobgM1IltAwzBJHce/kYnsBxOegh4mClU=; b=sQDvirnroqXUJ1a5lLjY26OR1ljTqAaqGPkL9lXhB91XSMz3GhZRO1Zscrnvcezj9h FbibqNbDtcjCSFEgyKWLGh3dl3K+xRipBos3STfyWYzL9e1BGiOWmQf9Mx2Xq89iPykp am7K803XeCNmdbkegyUTMKpxYbiwZxcnu2MUkIZRusjdxHQVoS34MMGPHI+KE+hDI/bt 7SYt06yP8lfwWCfFEbF/GsJut47TYoSpxY4g5a77YTqppywEZKhZkn4rOo1cqigdGjxM bWAfsJpCYWwxWC6kdUr3XWo8Im5A7fq0MrOQCkwSsjtaB59w1b/IG+MwhTcvEgDjrfqw qaAg== X-Received: by 10.50.88.8 with SMTP id bc8mr9744158igb.30.1447559737217; Sat, 14 Nov 2015 19:55:37 -0800 (PST) MIME-Version: 1.0 Reply-To: ashwanthkumar@googlemail.com Sender: ashwanth.kumar@gmail.com Received: by 10.107.11.163 with HTTP; Sat, 14 Nov 2015 19:54:57 -0800 (PST) From: Ashwanth Kumar Date: Sun, 15 Nov 2015 09:24:57 +0530 X-Google-Sender-Auth: an5f-zeTIfhgXpIZ1DmaRVoG-q8 Message-ID: Subject: Unable to submit jobs to a Hadoop cluster after a while To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=089e0111bea2ed9d2705248c4375 --089e0111bea2ed9d2705248c4375 Content-Type: text/plain; charset=UTF-8 We're running Hadoop 2.6.0 via CDH5.4.4 and we get the following error while submitting a new job 15/10/08 00:33:31 WARN security.UserGroupInformation: PriviledgedActionException as:hadoop (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /data/hadoopfs/mapred/staging/hadoop/.staging/job_201510050004_0388/job.jar could only be replicated to 0 nodes instead of minReplication (=1). There are 161 datanode(s) running and no node(s) are excluded in this operation. At that time we had 161 DNs running in the cluster. From the NN logs I see 2015-10-08 01:00:26,889 DEBUG org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to choose remote rack (location = ~/default-rack), fallback to local rack org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException: at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:691) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java:580) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:357) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:419) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:214) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:111) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:3746) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$200(BlockManager.java:3711) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1400) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1306) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3682) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3634) at java.lang.Thread.run(Thread.java:722) 2015-10-08 01:00:26,890 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 1 to reach 3 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) [ >From one of the live 160+ DN logs, we saw Node /default-rack/10.181.8.222:50010 [ Storage [DISK]DS-2d39f3c3-2e67-48ad-871b-632f66b277d7:NORMAL: 10.181.8.222:50010 is not chosen since the node is too busy (load: 2 > 1.8370786516853932) . ] Node /default-rack/10.181.25.147:50010 [ Storage [DISK]DS-60b511b0-62aa-4c0f-92d9-6d90ff32ee49:NORMAL: 10.181.25.147:50010 is not chosen since the node is too busy (load: 2 > 1.8370786516853932) . ] Node /default-rack/10.181.8.152:50010 [ Storage [DISK]DS-7e0bf761-86f2-4748-9eda-fbfd9c69e127:NORMAL: 10.181.8.152:50010 is not chosen since the node is too busy (load: 2 > 1.8370786516853932) . ] Node /default-rack/10.181.25.67:50010 [ Storage [DISK]DS-5849e4d8-4ab6-4392-aee2-7a354c82c19d:NORMAL: 10.181.25.67:50010 is not chosen since the node is too busy (load: 2 > 1.8370786516853932) . ] Few things we observed from our end - If we restart the NN, we're able to submit jobs without any issues - We run this Hadoop cluster on AWS - DN and TT process run on a single EC2 machine which is backed by an AutoScaling Group. - We've another cluster which does't autoscale and doesn't exhibit the behaviour Any pointers or ideas on how to solve this for good would be really appreciated. -- Ashwanth Kumar / ashwanthkumar.in --089e0111bea2ed9d2705248c4375 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
We're running Hadoop 2.6.0 via CDH5.4.4 and we get the= following error while submitting a new job

15/10/0= 8 00:33:31 WARN security.UserGroupInformation: PriviledgedActionException a= s:hadoop (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(java.io.= IOException): File /data/hadoopfs/mapred/staging/hadoop/.staging/job_201510= 050004_0388/job.jar could only be replicated to 0 nodes instead of minRepli= cation (=3D1).=C2=A0 There are 161 datanode(s) running and no node(s) are e= xcluded in this operation.

At that time we had 161= DNs running in the cluster. From the NN logs I see

2015-10-08 01:00:26,889 DEBUG org.apache.hadoop.hdfs.server.blockman= agement.BlockPlacementPolicy: Failed to choose remote rack (location =3D ~/= default-rack), fallback to local rack
org.apache.hadoop.hdfs.serv= er.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException:=C2=A0
at org.apache.h= adoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(= BlockPlacementPolicyDefault.java:691)
at org.apache.hadoop.hdfs.server.blockmanagement.Bl= ockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java= :580)
at org.a= pache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.choose= Target(BlockPlacementPolicyDefault.java:357)
at org.apache.hadoop.hdfs.server.blockmanage= ment.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.j= ava:419)
at or= g.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.cho= oseTarget(BlockPlacementPolicyDefault.java:214)
at org.apache.hadoop.hdfs.server.blockman= agement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefaul= t.java:111)
at= org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork= .chooseTargets(BlockManager.java:3746)
at org.apache.hadoop.hdfs.server.blockmanagement.B= lockManager$ReplicationWork.access$200(BlockManager.java:3711)
at org.apache.hadoop.hdfs.= server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockMa= nager.java:1400)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeRep= licationWork(BlockManager.java:1306)
at org.apache.hadoop.hdfs.server.blockmanagement.Blo= ckManager.computeDatanodeWork(BlockManager.java:3682)
at org.apache.hadoop.hdfs.server.bl= ockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3634)
at java.lang.Th= read.run(Thread.java:722)
2015-10-08 01:00:26,890 WARN org.apache= .hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place e= nough replicas, still in need of 1 to reach 3 (unavailableStorages=3D[DISK]= , storagePolicy=3DBlockStoragePolicy{HOT:7, storageTypes=3D[DISK], creation= Fallbacks=3D[], replicationFallbacks=3D[ARCHIVE]}, newBlock=3Dfalse) [

From one of the live 160+ DN logs, we saw=C2=A0<= /div>

Node /default-rack/10.181.8.222:50010 [
=C2=A0 Storage [DISK]DS-2d39= f3c3-2e67-48ad-871b-632f66b277d7:NORMAL:10.181.8.222:50010 is not chosen since the node is too busy (load: 2= > 1.8370786516853932) .
]
Node /default-rack/10.181.25.147:50010 [
=C2=A0= Storage [DISK]DS-60b511b0-62aa-4c0f-92d9-6d90ff32ee49:NORMAL:10.181.25.147:50010 is not chosen since the n= ode is too busy (load: 2 > 1.8370786516853932) .
]
N= ode /default-rack/10.181.8.152:50010<= /a> [
=C2=A0 Storage [DISK]DS-7e0bf761-86f2-4748-9eda-fbfd9c69e12= 7:NORMAL:10.181.8.152:50010 is no= t chosen since the node is too busy (load: 2 > 1.8370786516853932) .
]
Node /default-rack/10.181.25.67:50010 [
=C2=A0 Storage [DISK]DS-5849e4d8-4ab6-4= 392-aee2-7a354c82c19d:NORMAL:10.181.2= 5.67:50010 is not chosen since the node is too busy (load: 2 > 1.837= 0786516853932) .
]


= Few things we observed from our end
- If we restart the NN, we= 9;re able to submit jobs without any issues
- We run this Hadoop = cluster on AWS
- DN and TT process run on a single EC2 machine wh= ich is backed by an AutoScaling Group.
- We've another cluste= r which does't autoscale and doesn't exhibit the behaviour

Any pointers or ideas on how to solve this for good would b= e really appreciated.=C2=A0

--

<= /div>
Ashwanth Kumar /=C2=A0ashwanthkumar.in

--089e0111bea2ed9d2705248c4375--