hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jinsong Hu" <jinsong...@hotmail.com>
Subject Fw: hadoop data loss issue discovered. please fix!
Date Fri, 21 May 2010 21:17:31 GMT
Hi, There:
  While I used hadoop 0.20.9-yahoo distribution and hbase 0.20.4 version, I
found that the hadoop lose blocks under certain
situation, and thus corrupt hbase tables.



I compared namenode, datanode and hbase regionserver and figured out the
reason.



The regionserver 10.110.8.85 asks namenode 10.110.8.83 to save a block,
10.110.8.84 gives

Multiple IP,  regionserver choose 10.110.8.63 and save the block there.
After a while,  namenode

Asks the bock to be replicated to  10.110.8..86 and 10.110.8..69   machines.
A moment late, .86. .69 received

The replication, but strangely,  10.110.8..59  10.110.8..85 also received
replicaton of the block., even though it

Is not in the replication list.



Then the chooseExcessReplicates  asks to delele excess from .63, .69 ,
thinking there are too

Many replica.  Even though .63 was the original copy, the algorithm choose
to delete block based

On the amount of empty disk.  A moment later addToInvalidates ( not from
chooseExcessReplicates  ) asks

The block to be deleted on .86, .85, .59.  I check the code, this can only
be possible if

The block is corrupted.



In the end, this block doesn't exist anywhere in the cluster. And it is
permanently lost.





namenode:





2010-05-18 21:21:41,941 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSyst

em.chooseExcessReplicates: (10.110.8.63:50010,
blk_5636039758999247483_31304886)

 is added to recentInvalidateSets

2010-05-18 21:21:43,995 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask
10.1

10.8.63:50010 to delete  blk_5636039758999247483_31304886
blk_434931004890442915

7_31304519







the block was initially added to 10.110.8.63, then replicated to

10.110.8.63 59 69 86 85   . subsequently, replication process
addToInvalidates

removed all of them. the code review shows that the replicate is corrupt,

and all get deleted.





2010-05-18 21:21:29,913 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSyst

em.addStoredBlock: blockMap updated: 10.110.8.63:50010 is added to
blk_563603975

8999247483_31304886 size 441

2010-05-18 21:21:31,987 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask
10.1

10.8.63:50010 to replicate blk_5636039758999247483_31304886 to datanode(s)
10.11

0.8.86:50010 10.110.8.69:50010

2010-05-18 21:21:41,941 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSyst

em.chooseExcessReplicates: (10.110.8.63:50010,
blk_5636039758999247483_31304886)

 is added to recentInvalidateSets

2010-05-18 21:21:43,995 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask
10.1

10.8.63:50010 to delete  blk_5636039758999247483_31304886
blk_434931004890442915

7_31304519

2010-05-18 21:26:39,388 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSyst

em.addToInvalidates: blk_5636039758999247483 is added to invalidSet of
10.110.8.

63:50010

2010-05-18 21:26:45,953 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask
10.1

10.8.63:50010 to delete  blk_-1838286221287242082_31305179
blk_84467625641825134

17_31305143 blk_5636039758999247483_31304886
blk_4628640249731313760_31305046 bl

k_7460947863067370701_31270225 blk_-4468681536500281247_31270225
blk_84535177111

01429609_31303917 blk_9126133835045521966_31303972
blk_4623110280826973929_31305

203 blk_-2581238696314957800_31305033 blk_7461125351290749755_31305052









2010-05-18 21:21:31,987 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask
10.1

10.8.63:50010 to replicate blk_5636039758999247483_31304886 to datanode(s)
10.11

0.8.86:50010 10.110.8.69:50010

2010-05-18 21:21:33,156 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSyst

em.addStoredBlock: blockMap updated: 10.110.8.69:50010 is added to
blk_563603975

8999247483_31304886 size 441

2010-05-18 21:21:57,835 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSyst

em.chooseExcessReplicates: (10.110.8.69:50010,
blk_5636039758999247483_31304886)

 is added to recentInvalidateSets

2010-05-18 21:21:59,005 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask
10.1

10.8.69:50010 to delete  blk_5636039758999247483_31304886

2010-05-18 21:26:39,388 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSyst

em.addToInvalidates: blk_5636039758999247483 is added to invalidSet of
10.110.8.

69:50010

2010-05-18 21:26:42,951 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask
10.1

10.8.69:50010 to delete  blk_-2124965527858346013_31270213
blk_-5027506345849158

498_31270213 blk_5636039758999247483_31304886
blk_9148821113904458973_31305189 b

lk_4850797749721229572_31305072 blk_252039065084461924_31305031
blk_-83518367280

09062091_31305208 blk_-7576696059515014894_31305194
blk_-2900250119736465962_312

70214 blk_471700613578524871_31304950 blk_-190744003190006044_31305064
blk_72650

57386742001625_31305073











2010-05-18 21:21:31,987 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask
10.

10.8.63:50010 to replicate blk_5636039758999247483_31304886 to datanode(s)
10.1

0.8.86:50010 10.110.8.69:50010

2010-05-18 21:21:41,941 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSys

em.addStoredBlock: blockMap updated: 10.110.8.86:50010 is added to
blk_56360397

8999247483_31304886 size 441

2010-05-18 21:26:39,388 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSys

em.addToInvalidates: blk_5636039758999247483 is added to invalidSet of
10.110.8

86:50010

2010-05-18 21:26:42,951 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask
10.

10.8.86:50010 to delete  blk_-6242136662924452584_31259201
blk_5636039758999247

83_31304886 blk_4850797749721229572_31305072 blk_252039065084461924_31305031
bl

_-1317144678443645904_31305204 blk_6050185755706975664_31270230
blk_26714169718

5801868_31304948 blk_-5582352089328547938_31305022
blk_-3115115738671914626_312

0210









2010-05-18 21:21:34,413 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSyst

em.addStoredBlock: blockMap updated: 10.110.8.59:50010 is added to
blk_563603975

8999247483_31304886 size 441

2010-05-18 21:26:39,388 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSyst

em.addToInvalidates: blk_5636039758999247483 is added to invalidSet of
10.110.8.

59:50010

2010-05-18 21:26:39,950 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask
10.1

10.8.59:50010 to delete  blk_5636039758999247483_31304886
blk_-45285121566353996

25_31305212 blk_1439789418382469336_31305158
blk_8860574934531794641_31270219 bl

k_-8358193301564392132_31305029







2010-05-18 21:21:57,835 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSyst

em.addStoredBlock: blockMap updated: 10.110.8.85:50010 is added to
blk_563603975

8999247483_31304886 size 441

2010-05-18 21:26:39,388 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSyst

em.addToInvalidates: blk_5636039758999247483 is added to invalidSet of
10.110.8.

85:50010

2010-05-18 21:26:39,950 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask
10.1

10.8.85:50010 to delete  blk_-6242136662924452584_31259201
blk_56360397589992474

83_31304886 blk_4628640249731313760_31305046
blk_4747588241975451642_31305123 bl

k_-6876078628884993825_31270230 blk_-4468681536500281247_31270225
blk_7325830193

509411302_31270230 blk_8453517711101429609_31303917
blk_-6094734447689285387_313

05127 blk_3353439739797003235_31305037 blk_-5027506345849158498_31270213
blk_148

4161645992497144_31270225 blk_4464987648045469454_31305144
blk_74609478630673707

01_31270225 blk_-1170815606945644545_31270230
blk_6050185755706975664_31270230 b

lk_-8358193301564392132_31305029 blk_2671416971885801868_31304948
blk_5593547375

459437465_31286511 blk_-2581238696314957800_31305033
blk_4732635559915402193_312

70230 blk_-2124965527858346013_31270213 blk_-5837992573431863412_31286612
blk_-4

32558447034944954_31270208 blk_-3407615138527189735_31305069
blk_886057493453179

4641_31270219 blk_233110856487529716_31270229
blk_312750273180273303_31270228 bl

k_7461125351290749755_31305052 blk_-8902661185532055148_31304947
blk_-8555258258

738129670_31270210 blk_252039065084461924_31305031
blk_9037118763503479133_31305

120 blk_-8494656323754369174_31305105 blk_9126133835045521966_31303972
blk_-5582

352089328547938_31305022 blk_-2900250119736465962_31270214
blk_-3115115738671914

626_31270210 blk_7612090442234634555_31270225
blk_5876492007747505188_31270213 b

lk_471700613578524871_31304950 blk_-190744003190006044_31305064























datanode 10.110.8.63:



2010-05-18 21:21:46,058 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: De

leting block blk_5636039758999247483_31304886 file
/hadoop_data_dir/dfs/data/cur

rent/subdir23/blk_5636039758999247483









hbase region server 10.110.8.85:



DFSClient.java,

DatanodeInfo chosenNode = bestNode(nodes, deadNodes);

InetSocketAddress targetAddr =
NetUtils.createSocketAddr(chosenNode.getName());

return new DNAddrPair(chosenNode, targetAddr);



still picked the 10.110.8.63 even though the command is sent from name node
21:21:43,995

to delete the block, and it was executed at 21:21:46,058 . ?









2010-05-18 21:21:46,188 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
connect

 to /10.110.8.63:50010 for file
/hbase/.META./1028785192/info/656097411976846533

 for block 5636039758999247483:java.io.IOException: Got error in response to
OP_

READ_BLOCK for file /hbase/.META./1028785192/info/656097411976846533 for
block 5

636039758999247483






 

Mime
View raw message