hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zephyr Guo (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
Date Fri, 09 Mar 2018 13:43:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392889#comment-16392889
] 

Zephyr Guo edited comment on HDFS-13243 at 3/9/18 1:42 PM:
-----------------------------------------------------------

[~jojochuang]
{quote}
I suspect this race condition happens because of this unusual setting.(or makes it more prone
to this bug)
{quote}
The minimal replication is set to 1 in my test case. I agree that this unusual setting makes
it more prone to this bug.

{qupte}
If the problem is client side race condition, I would recommend fixing it at client side.
{quote}
We have to fix server-side as well. You have no power to let all user update their client
code, right?
I will write a new patch in several days, thanks for your advice. 

BTW, why don't we include dfsClient.namenode.fsync() into synchronized code block formerly?Is
this for performance benefit?If that, Is it necessary to fix the client?




was (Author: gzh1992n):
[~jojochuang]
{quote}
I suspect this race condition happens because of this unusual setting.(or makes it more prone
to this bug)
{quote}
The minimal replication is 1 in my test case. I agree that this unusual setting makes it more
prone to this bug.

{qupte}
If the problem is client side race condition, I would recommend fixing it at client side.
{quote}
We have to fix server-side as well. You have no power to let all user update their client
code, right?
I will write a new patch in several days, thanks for your advice. 

BTW, why don't we include dfsClient.namenode.fsync() into synchronized code block formerly?Is
this for performance benefit?If that, Is it necessary to fix the client?



> Get CorruptBlock because of calling close and sync in same time
> ---------------------------------------------------------------
>
>                 Key: HDFS-13243
>                 URL: https://issues.apache.org/jira/browse/HDFS-13243
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.7.2, 3.2.0
>            Reporter: Zephyr Guo
>            Assignee: Zephyr Guo
>            Priority: Critical
>             Fix For: 3.2.0
>
>         Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch
>
>
> HDFS File might get broken because of corrupt block(s) that could be produced by calling
close and sync in the same time.
> When calling close was not successful, UCBlock status would change to COMMITTED, and
if a sync request gets popped from queue and processed, sync operation would change the last
block length.
> After that, DataNode would report all received block to NameNode, and will check Block
length of all COMMITTED Blocks. But the block length was already different between recorded
in NameNode memory and reported by DataNode, and consequently, the last block is marked as
corruptted because of inconsistent length.
>  
> {panel:title=Log in my hdfs}
> 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION,
truncateBlock=null, primaryNodeIndex=-1, replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
for /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* fsync: /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
for DFSClient_NONMAPREDUCE_1077513762_1
> 2018-03-05 04:05:39,761 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK*
blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, primaryNodeIndex=-1, replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in file /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated:
10.0.0.220:50010 is added to blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null,
primaryNodeIndex=-1, replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
size 2054413
> 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap:
blk_1085498930 added as corrupt on 10.0.0.219:50010 by hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219
because block is COMMITTED and reported length 2054413 does not match length in block map
141232
> 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap:
blk_1085498930 added as corrupt on 10.0.0.218:50010 by hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218
because block is COMMITTED and reported length 2054413 does not match length in block map
141232
> 2018-03-05 04:05:40,162 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK*
blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, primaryNodeIndex=-1, replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW],
ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]}
is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in file /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515
> {panel}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message