Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id E16A3200BC0 for ; Tue, 15 Nov 2016 11:54:55 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id DE463160B03; Tue, 15 Nov 2016 10:54:55 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 0CE06160B02 for ; Tue, 15 Nov 2016 11:54:54 +0100 (CET) Received: (qmail 76714 invoked by uid 500); 15 Nov 2016 10:54:53 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 76703 invoked by uid 99); 15 Nov 2016 10:54:52 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Nov 2016 10:54:52 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 16FCFC18B0 for ; Tue, 15 Nov 2016 10:54:52 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.93 X-Spam-Level: * X-Spam-Status: No, score=1.93 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, WEIRD_PORT=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id g4omdi-gBNNC for ; Tue, 15 Nov 2016 10:54:50 +0000 (UTC) Received: from mail-it0-f49.google.com (mail-it0-f49.google.com [209.85.214.49]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id A91FA5F24C for ; Tue, 15 Nov 2016 10:54:49 +0000 (UTC) Received: by mail-it0-f49.google.com with SMTP id q124so185447640itd.1 for ; Tue, 15 Nov 2016 02:54:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to; bh=JqZ+Dk8WV77x+S7PTguIt2LbxisMhudIz1/o4dmE7VE=; b=QCiU9+C5SHg1yo2N9bGBBZAH7ab0wHQhnvPEtvBNAkg1TycQsyCMwpkH4efsmLsG2z 0b99Cn9uN50uqaKEIvMah40n2ceQXCcgk/NOQtsB2wW/0POKjh8B+yLSO8AURrprcS+t kxQl5CsJP9txP8Ms8fWSiS/p79q+R/OA6w29mPdK+Fmtt8q+Cq28IM3OxoJV3jOMcarR NwOzk7lIyksP1DrtXFYtmeYz+1Yd8smxwa/Yag/JbfOn33waSzEfFpWEaK2Ciq0+L5l1 Q82PIWGl+XUGEFB3bE4IxRtgPOC696cU5n/4YlG3bG7cXD8GidXrwm/TygRzXUM+3qNG CsGQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=JqZ+Dk8WV77x+S7PTguIt2LbxisMhudIz1/o4dmE7VE=; b=Xk7nF6zvOLSr1sH738XMBrEYRUos+BhkXIyKkWJjXlKuIorhyz4qQ4azsBO2F9uaI9 4YEDD2CTZXXzXopnV0BMfHcrcyo60+uB4P8O3VMC8uvTUiC4qtDIEj1GzbuqJdADUPaL AXzK3dotv/w7u6I6vAtgD9XN9AdlilhPln6U6bJDH6uV3EYbSgO7BJ7y6q27keh4Uxct P7nVT7NHTavzWJs8MnxlkiG2H2KhlAHYlfelzwbYY83u+ANRZEPJ1mWD/SleG9H/MTi7 QDs+yXW9BLVG2OzN+zArp3M97xbBUdqiojoTE93SDX4jKr0WU05lgr+4YM+w2qJD/cVy muSw== X-Gm-Message-State: ABUngvdfktM8HybmeDWThw8PUIkgtqnWAzKxbGfZyJWR66pXEYN4aWN24uCppREoWSmS0ZxygjAO0wm4oOspyA== X-Received: by 10.107.29.199 with SMTP id d190mr29030348iod.82.1479207285840; Tue, 15 Nov 2016 02:54:45 -0800 (PST) MIME-Version: 1.0 From: Hariharan Date: Tue, 15 Nov 2016 10:54:35 +0000 Message-ID: Subject: HDFS - Corrupt replicas preventing decommissioning? To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a113fe68cd2622f054154c869 archived-at: Tue, 15 Nov 2016 10:54:56 -0000 --001a113fe68cd2622f054154c869 Content-Type: text/plain; charset=UTF-8 Hello folks, I'm running Apache Hadoop 2.6.0 and I'm seeing a weird problem where I keep seeing corrupt replicas. Example: 2016-11-15 06:42:38,104 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block: *blk_1073747320_231160*{blockUCState=COMMITTED, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[[DISK]DS-11d5d492-a608-4bc0-9a04-048b8127bb32:NORMAL:10.0.8.185:50010|RBW]]}, Expected Replicas: 2, *live replicas: 0, corrupt replicas: 2*, decommissioned replicas: 1, excess replicas: 0, Is Open File: true, Datanodes having this block: 10.0.8.185:50010 10.0.8.148:50010 10.0.8.149:50010 , Current Datanode: 10.0.8.185:50010, Is current datanode decommissioning: true But I can't figure out which file this block belongs to - *hadoop fsck / -files -blocks -locations | grep blk_1073747320_231160* returns nothing. So I'm unable to delete the file and my concern is that this seems to be blocking decommissioning of my datanode (going on for ~18 hours now) since, looking at the code in BlockManager.java, we would not mark the DN as decommissioned if there are blocks with no live replicas on it. My questions are: 1. What causes corrupt replicas and how to avoid them? I seem to be seeing these frequently: (examples from prior runs) hadoop-hdfs-namenode-ip-10-0-8-199.log.9:2016-11-13 23:54:57,513 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block: blk_1074063633_2846521{blockUCState=COMMITTED, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[[DISK]DS-7b8e7b76-6066-43fb-8340-d93f7ab9c6ea:NORMAL:10.0.8.75:50010|RBW]]}, Expected Replicas: 2, *live replicas: 0*, *corrupt replicas: 4*, decommissioned replicas: 1, excess replicas: 0, Is Open File: true, Datanodes having this block: 10.0.8.75:50010 10.0.8.156:50010 10.0.8.188:50010 10.0.8.34:50010 10.0.8.74:50010 , Current Datanode: 10.0.8.75:50010, Is current datanode decommissioning: true hadoop-hdfs-namenode-ip-10-0-8-199.log.9:2016-11-13 23:54:57,513 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block: blk_1073975974_2185091{blockUCState=COMMITTED, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[[DISK]DS-b9b8b191-f8c8-49b0-b4c1-b2a9ce6b9ee8:NORMAL:10.0.8.153:50010|RBW]]}, Expected Replicas: 2, *live replicas: 0, corrupt replicas: 3*, decommissioned replicas: 1, excess replicas: 0, Is Open File: true, Datanodes having this block: 10.0.8.153:50010 10.0.8.74:50010 10.0.8.7:50010 10.0.8.198:50010 , Current Datanode: 10.0.8.153:50010, Is current datanode decommissioning: true hadoop-hdfs-namenode-ip-10-0-8-199.log.9:2016-11-13 23:54:57,513 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block: blk_1073975974_2185091{blockUCState=COMMITTED, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[[DISK]DS-b9b8b191-f8c8-49b0-b4c1-b2a9ce6b9ee8:NORMAL:10.0.8.153:50010|RBW]]}, Expected Replicas: 2, *live replicas: 0, corrupt replicas: 3*, decommissioned replicas: 1, excess replicas: 0, Is Open File: true, Datanodes having this block: 10.0.8.153:50010 10.0.8.74:50010 10.0.8.7:50010 10.0.8.198:50010 , Current Datanode: 10.0.8.7:50010, Is current datanode decommissioning: true 2. Is this possibly a JIRA that's fixed in recent versions (I realize I'm running a very old version)? 3. Anything I can do to "force" decommissioning of such nodes (apart from forcefully terminating them)? Thanks, Hari --001a113fe68cd2622f054154c869 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hello folks,

I'm running Apache= Hadoop 2.6.0 and I'm seeing a weird problem where I keep seeing corrup= t replicas. Example:
2016-11-15 06:42:38,104 INFO org.apache.hadoop.hdfs= .server.blockmanagement.BlockManager: Block: blk_1073747320_231160{b= lockUCState=3DCOMMITTED, primaryNodeIndex=3D0, replicas=3D[ReplicaUnderCons= truction[[DISK]DS-11d5d492-a608-4bc0-9a04-048b8127bb32:NORMAL:10.0.8.185:50= 010|RBW]]}, Expected Replicas: 2, live replicas: 0, corrupt replicas: 2<= /b>, decommissioned replicas: 1, excess replicas: 0, Is Open File: true, Da= tanodes having this block: 10.0.8.185:5= 0010 10.0.8.148:50010 10.0.8.149:50010 , Current Datanode: 10.0.8.185:50010, Is current datanode de= commissioning: true

But I can't figure out which file= this block belongs to - hadoop fsck / -files -blocks -locations | grep = blk_1073747320_231160 returns nothing.

So I'm una= ble to delete the file and my concern is that this seems to be blocking dec= ommissioning of my datanode (going on for ~18 hours now) since, looking at = the code in BlockManager.java, we would not mark the DN as decommissioned i= f there are blocks with no live replicas on it.

My questi= ons are:
1. What causes corrupt replicas and how to avoid the= m? I seem to be seeing these frequently:
(examples from prior= runs)
hadoop-hdfs-namenode-ip-10-0-8-199.log.9:2016-11-13 23:54:57,513 = INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block: blk= _1074063633_2846521{blockUCState=3DCOMMITTED, primaryNodeIndex=3D0, replica= s=3D[ReplicaUnderConstruction[[DISK]DS-7b8e7b76-6066-43fb-8340-d93f7ab9c6ea= :NORMAL:10.0.8.75:50010|RBW]]}, Expected Replicas: 2, live replicas: 0, corrupt replicas: 4, decommissioned replicas: 1, excess replicas= : 0, Is Open File: true, Datanodes having this block: 10.0.8.75:50010 10.0.= 8.156:50010 10.0.8.188:50010 10.0.8.34:50010 10.0.8.74:50010 , Current Datanode: 10.0.8.75:50010, Is current datanode decommissioning: true=
hadoop-hdfs-namenode-ip-10-0-8-199.log.9:2016-11-13 23:54:57,513 INFO o= rg.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block: blk_10739= 75974_2185091{blockUCState=3DCOMMITTED, primaryNodeIndex=3D0, replicas=3D[R= eplicaUnderConstruction[[DISK]DS-b9b8b191-f8c8-49b0-b4c1-b2a9ce6b9ee8:NORMA= L:10.0.8.153:50010|RBW]]}, Expected Replicas: 2, live replicas: 0, corru= pt replicas: 3, decommissioned replicas: 1, excess replicas: 0, Is Open= File: true, Datanodes having this block: 10.0.8.153:50010 10.0.8.74:50010<= /a> 10.0.8.7:50010 10.0.8.198:50010 , Current Datanode: 10.0.8.153:50010, Is current datanode decommissioni= ng: true
hadoop-hdfs-namenode-ip-10-0-8-199.log.9:2016-11-13 23:54:57,51= 3 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block: b= lk_1073975974_2185091{blockUCState=3DCOMMITTED, primaryNodeIndex=3D0, repli= cas=3D[ReplicaUnderConstruction[[DISK]DS-b9b8b191-f8c8-49b0-b4c1-b2a9ce6b9e= e8:NORMAL:10.0.8.153:50010|RBW]]}, Expected Replicas: 2, live replicas: = 0, corrupt replicas: 3, decommissioned replicas: 1, excess replicas: 0,= Is Open File: true, Datanodes having this block: 10.0.8.153:50010 10.0.8.7= 4:50010 10.0.8.7:50010 10.0.8.198:50010 , Current Datanode: 10.0.8.7:50010, Is current datanode decommis= sioning: true

2. Is this possibly a JIRA that's fixed= in recent versions (I realize I'm running a very old version)?
3. Anything I can do to "force" decommissioning of such no= des (apart from forcefully terminating them)?

Thanks,
=
Hari




--001a113fe68cd2622f054154c869--