Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9D50518C73 for ; Thu, 27 Aug 2015 07:30:46 +0000 (UTC) Received: (qmail 23637 invoked by uid 500); 27 Aug 2015 07:30:46 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 23584 invoked by uid 500); 27 Aug 2015 07:30:46 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 23334 invoked by uid 99); 27 Aug 2015 07:30:46 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Aug 2015 07:30:46 +0000 Date: Thu, 27 Aug 2015 07:30:46 +0000 (UTC) From: "Walter Su (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HDFS-8763) After file closed, a race condition between IBR of 3rd replica of lastBlock and ReplicationMonitor MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-8763?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:all-tabpanel ] Walter Su updated HDFS-8763: ---------------------------- Description:=20 -For our cluster, the NameNode is always very busy, so for every incrementa= l block report , the contention of lock is heavy.- -The logic of incremental block report is as follow, client send block to d= n1 and dn1 mirror to dn2, dn2 mirror to dn3. After finish this block, all d= atanode will report the newly received block to namenode. In NameNode side,= all will go to the method processIncrementalBlockReport in BlockManager cl= ass. But the status of the block reported from dn2,dn3 is RECEIVING_BLOCK, = for dn1 is RECEIED_BLOCK. It is okay if dn2, dn3 report before dn1(that is = common), but in some busy environment, it is easy to find dn1 report before= dn2 or dn3, let=E2=80=99s assume dn2 report first, dn1 report second, and = dn3 report third.- -So dn1 will addStoredBlock and find the replica of this block is not reach= the the original number(which is 3), and the block will add to neededRepli= cations construction and soon ask some node in pipeline (dn1 or dn2)to repl= ica it dn4 . After sometime, dn4 and dn3 all report this block, then choose= one node to invalidate.- Here is one log i found in our cluster: {noformat} 2015-07-08 01:05:34,675 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* all= ocateBlock: /logs/***_bigdata_spam/logs/application_1435099124107_470749/xx= .xx.4.62_45454.tmp. BP-1386326728-xx.xx.2.131-1382089338395 blk_3194502674_= 2121080184{blockUCState=3DUNDER_CONSTRUCTION, primaryNodeIndex=3D-1, replic= as=3D[ReplicaUnderConstruction[[DISK]DS-a7c0f8f6-2399-4980-9479-efa08487b7b= 3:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-c75145a0-ed63-4180-87ee-d4= 8ccaa647c5:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-15a4dc8e-5b7d-449= f-a941-6dced45e6f07:NORMAL|RBW]]} 2015-07-08 01:05:34,689 INFO BlockStateChange: BLOCK* addStoredBlock: block= Map updated: xx.xx.7.75:50010 is added to blk_3194502674_2121080184{blockUC= State=3DUNDER_CONSTRUCTION, primaryNodeIndex=3D-1, replicas=3D[ReplicaUnder= Construction[[DISK]DS-15a4dc8e-5b7d-449f-a941-6dced45e6f07:NORMAL|RBW], Rep= licaUnderConstruction[[DISK]DS-74ed264f-da43-4cc3-9fa9-164ba99f752a:NORMAL|= RBW], ReplicaUnderConstruction[[DISK]DS-56121ce1-8991-45b3-95bc-2a535799151= 2:NORMAL|RBW]]} size 0 2015-07-08 01:05:34,689 INFO BlockStateChange: BLOCK* addStoredBlock: block= Map updated: xx.xx.4.62:50010 is added to blk_3194502674_2121080184{blockUC= State=3DUNDER_CONSTRUCTION, primaryNodeIndex=3D-1, replicas=3D[ReplicaUnder= Construction[[DISK]DS-15a4dc8e-5b7d-449f-a941-6dced45e6f07:NORMAL|RBW], Rep= licaUnderConstruction[[DISK]DS-74ed264f-da43-4cc3-9fa9-164ba99f752a:NORMAL|= RBW], ReplicaUnderConstruction[[DISK]DS-56121ce1-8991-45b3-95bc-2a535799151= 2:NORMAL|RBW]]} size 0 2015-07-08 01:05:35,003 INFO BlockStateChange: BLOCK* ask xx.xx.4.62:50010 = to replicate blk_3194502674_2121080184 to datanode(s) xx.xx.4.65:50010 2015-07-08 01:05:35,403 INFO BlockStateChange: BLOCK* addStoredBlock: block= Map updated: xx.xx.7.73:50010 is added to blk_3194502674_2121080184 size 67= 750 2015-07-08 01:05:35,833 INFO BlockStateChange: BLOCK* addStoredBlock: block= Map updated: xx.xx.4.65:50010 is added to blk_3194502674_2121080184 size 67= 750 2015-07-08 01:05:35,833 INFO BlockStateChange: BLOCK* InvalidateBlocks: add= blk_3194502674_2121080184 to xx.xx.7.75:50010 2015-07-08 01:05:35,833 INFO BlockStateChange: BLOCK* chooseExcessReplicate= s: (xx.xx.7.75:50010, blk_3194502674_2121080184) is added to invalidated bl= ocks set 2015-07-08 01:05:35,852 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* Inv= alidateBlocks: ask xx.xx.7.75:50010 to delete [blk_3194502674_2121080184, b= lk_3194497594_2121075104] {noformat} Some day, the number of this situation can be 400000, that is not good for = the performance and waste network band. Our base version is hadoop 2.4 and i have checked hadoop 2.7.1 didn=E2=80= =99t find any difference. was: For our cluster, the NameNode is always very busy, so for every incremental= block report , the contention of lock is heavy. The logic of incremental block report is as follow, client send block to dn= 1 and dn1 mirror to dn2, dn2 mirror to dn3. After finish this block, all da= tanode will report the newly received block to namenode. In NameNode side, = all will go to the method processIncrementalBlockReport in BlockManager cla= ss. But the status of the block reported from dn2,dn3 is RECEIVING_BLOCK, f= or dn1 is RECEIED_BLOCK. It is okay if dn2, dn3 report before dn1(that is c= ommon), but in some busy environment, it is easy to find dn1 report before = dn2 or dn3, let=E2=80=99s assume dn2 report first, dn1 report second, and d= n3 report third. So dn1 will addStoredBlock and find the replica of this block is not reach = the the original number(which is 3), and the block will add to neededReplic= ations construction and soon ask some node in pipeline (dn1 or dn2)to repli= ca it dn4 . After sometime, dn4 and dn3 all report this block, then choose = one node to invalidate. Here is one log i found in our cluster: 2015-07-08 01:05:34,675 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* all= ocateBlock: /logs/***_bigdata_spam/logs/application_1435099124107_470749/xx= .xx.4.62_45454.tmp. BP-1386326728-xx.xx.2.131-1382089338395 blk_3194502674_= 2121080184{blockUCState=3DUNDER_CONSTRUCTION, primaryNodeIndex=3D-1, replic= as=3D[ReplicaUnderConstruction[[DISK]DS-a7c0f8f6-2399-4980-9479-efa08487b7b= 3:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-c75145a0-ed63-4180-87ee-d4= 8ccaa647c5:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-15a4dc8e-5b7d-449= f-a941-6dced45e6f07:NORMAL|RBW]]} 2015-07-08 01:05:34,689 INFO BlockStateChange: BLOCK* addStoredBlock: block= Map updated: xx.xx.7.75:50010 is added to blk_3194502674_2121080184{blockUC= State=3DUNDER_CONSTRUCTION, primaryNodeIndex=3D-1, replicas=3D[ReplicaUnder= Construction[[DISK]DS-15a4dc8e-5b7d-449f-a941-6dced45e6f07:NORMAL|RBW], Rep= licaUnderConstruction[[DISK]DS-74ed264f-da43-4cc3-9fa9-164ba99f752a:NORMAL|= RBW], ReplicaUnderConstruction[[DISK]DS-56121ce1-8991-45b3-95bc-2a535799151= 2:NORMAL|RBW]]} size 0 2015-07-08 01:05:34,689 INFO BlockStateChange: BLOCK* addStoredBlock: block= Map updated: xx.xx.4.62:50010 is added to blk_3194502674_2121080184{blockUC= State=3DUNDER_CONSTRUCTION, primaryNodeIndex=3D-1, replicas=3D[ReplicaUnder= Construction[[DISK]DS-15a4dc8e-5b7d-449f-a941-6dced45e6f07:NORMAL|RBW], Rep= licaUnderConstruction[[DISK]DS-74ed264f-da43-4cc3-9fa9-164ba99f752a:NORMAL|= RBW], ReplicaUnderConstruction[[DISK]DS-56121ce1-8991-45b3-95bc-2a535799151= 2:NORMAL|RBW]]} size 0 2015-07-08 01:05:35,003 INFO BlockStateChange: BLOCK* ask xx.xx.4.62:50010 = to replicate blk_3194502674_2121080184 to datanode(s) xx.xx.4.65:50010 2015-07-08 01:05:35,403 INFO BlockStateChange: BLOCK* addStoredBlock: block= Map updated: xx.xx.7.73:50010 is added to blk_3194502674_2121080184 size 67= 750 2015-07-08 01:05:35,833 INFO BlockStateChange: BLOCK* addStoredBlock: block= Map updated: xx.xx.4.65:50010 is added to blk_3194502674_2121080184 size 67= 750 2015-07-08 01:05:35,833 INFO BlockStateChange: BLOCK* InvalidateBlocks: add= blk_3194502674_2121080184 to xx.xx.7.75:50010 2015-07-08 01:05:35,833 INFO BlockStateChange: BLOCK* chooseExcessReplicate= s: (xx.xx.7.75:50010, blk_3194502674_2121080184) is added to invalidated bl= ocks set 2015-07-08 01:05:35,852 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* Inv= alidateBlocks: ask xx.xx.7.75:50010 to delete [blk_3194502674_2121080184, b= lk_3194497594_2121075104] Some day, the number of this situation can be 400000, that is not good for = the performance and waste network band. Our base version is hadoop 2.4 and i have checked hadoop 2.7.1 didn=E2=80= =99t find any difference. > After file closed, a race condition between IBR of 3rd replica of lastBlo= ck and ReplicationMonitor > -------------------------------------------------------------------------= ------------------------- > > Key: HDFS-8763 > URL: https://issues.apache.org/jira/browse/HDFS-8763 > Project: Hadoop HDFS > Issue Type: Bug > Components: HDFS > Affects Versions: 2.4.0 > Reporter: jiangyu > Assignee: Walter Su > Priority: Minor > Attachments: HDFS-8763.01.patch > > > -For our cluster, the NameNode is always very busy, so for every incremen= tal block report , the contention of lock is heavy.- > -The logic of incremental block report is as follow, client send block to= dn1 and dn1 mirror to dn2, dn2 mirror to dn3. After finish this block, all= datanode will report the newly received block to namenode. In NameNode sid= e, all will go to the method processIncrementalBlockReport in BlockManager = class. But the status of the block reported from dn2,dn3 is RECEIVING_BLOCK= , for dn1 is RECEIED_BLOCK. It is okay if dn2, dn3 report before dn1(that i= s common), but in some busy environment, it is easy to find dn1 report befo= re dn2 or dn3, let=E2=80=99s assume dn2 report first, dn1 report second, an= d dn3 report third.- > -So dn1 will addStoredBlock and find the replica of this block is not rea= ch the the original number(which is 3), and the block will add to neededRep= lications construction and soon ask some node in pipeline (dn1 or dn2)to re= plica it dn4 . After sometime, dn4 and dn3 all report this block, then choo= se one node to invalidate.- > Here is one log i found in our cluster: > {noformat} > 2015-07-08 01:05:34,675 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* a= llocateBlock: /logs/***_bigdata_spam/logs/application_1435099124107_470749/= xx.xx.4.62_45454.tmp. BP-1386326728-xx.xx.2.131-1382089338395 blk_319450267= 4_2121080184{blockUCState=3DUNDER_CONSTRUCTION, primaryNodeIndex=3D-1, repl= icas=3D[ReplicaUnderConstruction[[DISK]DS-a7c0f8f6-2399-4980-9479-efa08487b= 7b3:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-c75145a0-ed63-4180-87ee-= d48ccaa647c5:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-15a4dc8e-5b7d-4= 49f-a941-6dced45e6f07:NORMAL|RBW]]} > 2015-07-08 01:05:34,689 INFO BlockStateChange: BLOCK* addStoredBlock: blo= ckMap updated: xx.xx.7.75:50010 is added to blk_3194502674_2121080184{block= UCState=3DUNDER_CONSTRUCTION, primaryNodeIndex=3D-1, replicas=3D[ReplicaUnd= erConstruction[[DISK]DS-15a4dc8e-5b7d-449f-a941-6dced45e6f07:NORMAL|RBW], R= eplicaUnderConstruction[[DISK]DS-74ed264f-da43-4cc3-9fa9-164ba99f752a:NORMA= L|RBW], ReplicaUnderConstruction[[DISK]DS-56121ce1-8991-45b3-95bc-2a5357991= 512:NORMAL|RBW]]} size 0 > 2015-07-08 01:05:34,689 INFO BlockStateChange: BLOCK* addStoredBlock: blo= ckMap updated: xx.xx.4.62:50010 is added to blk_3194502674_2121080184{block= UCState=3DUNDER_CONSTRUCTION, primaryNodeIndex=3D-1, replicas=3D[ReplicaUnd= erConstruction[[DISK]DS-15a4dc8e-5b7d-449f-a941-6dced45e6f07:NORMAL|RBW], R= eplicaUnderConstruction[[DISK]DS-74ed264f-da43-4cc3-9fa9-164ba99f752a:NORMA= L|RBW], ReplicaUnderConstruction[[DISK]DS-56121ce1-8991-45b3-95bc-2a5357991= 512:NORMAL|RBW]]} size 0 > 2015-07-08 01:05:35,003 INFO BlockStateChange: BLOCK* ask xx.xx.4.62:5001= 0 to replicate blk_3194502674_2121080184 to datanode(s) xx.xx.4.65:50010 > 2015-07-08 01:05:35,403 INFO BlockStateChange: BLOCK* addStoredBlock: blo= ckMap updated: xx.xx.7.73:50010 is added to blk_3194502674_2121080184 size = 67750 > 2015-07-08 01:05:35,833 INFO BlockStateChange: BLOCK* addStoredBlock: blo= ckMap updated: xx.xx.4.65:50010 is added to blk_3194502674_2121080184 size = 67750 > 2015-07-08 01:05:35,833 INFO BlockStateChange: BLOCK* InvalidateBlocks: a= dd blk_3194502674_2121080184 to xx.xx.7.75:50010 > 2015-07-08 01:05:35,833 INFO BlockStateChange: BLOCK* chooseExcessReplica= tes: (xx.xx.7.75:50010, blk_3194502674_2121080184) is added to invalidated = blocks set > 2015-07-08 01:05:35,852 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* I= nvalidateBlocks: ask xx.xx.7.75:50010 to delete [blk_3194502674_2121080184,= blk_3194497594_2121075104] > {noformat} > Some day, the number of this situation can be 400000, that is not good fo= r the performance and waste network band. > Our base version is hadoop 2.4 and i have checked hadoop 2.7.1 didn=E2=80= =99t find any difference. -- This message was sent by Atlassian JIRA (v6.3.4#6332)