hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jinsong Hu" <jinsong...@hotmail.com>
Subject Re: Fw: hadoop data loss issue discovered. please fix!
Date Fri, 21 May 2010 22:32:36 GMT
Hi, Todd:
  I have cut all the logs for that 10 minutes from the 3 machines,  and if
you grep 5636039758999247483, you will see
all the logs from the regionserver, namenode, and data node. You can compare
the time sequence.
all our machine's time is sync'ed via ntp, so you can compare exactly side
by side and show the
sequence of events.
  The build I took is from yahoo 0.20.9 distribution.

Jinsong

--------------------------------------------------
From: "Todd Lipcon" <todd@cloudera.com>
Sent: Friday, May 21, 2010 2:20 PM
To: <hdfs-dev@hadoop.apache.org>
Subject: Re: Fw: hadoop data loss issue discovered. please fix!

> Hi Jinsong,
>
> Could you upload a tarball of the log files somewhere from each of the DNs
> and the RS involved? It's hard to trace through the log in the email (the
> email added all kinds of wrapping, etc)
>
> -Todd
>
> On Fri, May 21, 2010 at 2:17 PM, Jinsong Hu <jinsong_hu@hotmail.com> 
> wrote:
>
>> Hi, There:
>>  While I used hadoop 0.20.9-yahoo distribution and hbase 0.20.4 version, 
>> I
>> found that the hadoop lose blocks under certain
>> situation, and thus corrupt hbase tables.
>>
>>
>>
>> I compared namenode, datanode and hbase regionserver and figured out the
>> reason.
>>
>>
>>
>> The regionserver 10.110.8.85 asks namenode 10.110.8.83 to save a block,
>> 10.110.8.84 gives
>>
>> Multiple IP,  regionserver choose 10.110.8.63 and save the block there.
>> After a while,  namenode
>>
>> Asks the bock to be replicated to  10.110.8..86 and 10.110.8..69
>> machines.
>> A moment late, .86. .69 received
>>
>> The replication, but strangely,  10.110.8..59  10.110.8..85 also received
>> replicaton of the block., even though it
>>
>> Is not in the replication list.
>>
>>
>>
>> Then the chooseExcessReplicates  asks to delele excess from .63, .69 ,
>> thinking there are too
>>
>> Many replica.  Even though .63 was the original copy, the algorithm 
>> choose
>> to delete block based
>>
>> On the amount of empty disk.  A moment later addToInvalidates ( not from
>> chooseExcessReplicates  ) asks
>>
>> The block to be deleted on .86, .85, .59.  I check the code, this can 
>> only
>> be possible if
>>
>> The block is corrupted.
>>
>>
>>
>> In the end, this block doesn't exist anywhere in the cluster. And it is
>> permanently lost.
>>
>>
>>
>>
>>
>> namenode:
>>
>>
>>
>>
>>
>> 2010-05-18 21:21:41,941 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
>> NameSyst
>>
>> em.chooseExcessReplicates: (10.110.8.63:50010,
>> blk_5636039758999247483_31304886)
>>
>> is added to recentInvalidateSets
>>
>> 2010-05-18 21:21:43,995 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
>> ask
>> 10.1
>>
>> 10.8.63:50010 to delete  blk_5636039758999247483_31304886
>> blk_434931004890442915
>>
>> 7_31304519
>>
>>
>>
>>
>>
>>
>>
>> the block was initially added to 10.110.8.63, then replicated to
>>
>> 10.110.8.63 59 69 86 85   . subsequently, replication process
>> addToInvalidates
>>
>> removed all of them. the code review shows that the replicate is corrupt,
>>
>> and all get deleted.
>>
>>
>>
>>
>>
>> 2010-05-18 21:21:29,913 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
>> NameSyst
>>
>> em.addStoredBlock: blockMap updated: 10.110.8.63:50010 is added to
>> blk_563603975
>>
>> 8999247483_31304886 size 441
>>
>> 2010-05-18 21:21:31,987 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
>> ask
>> 10.1
>>
>> 10.8.63:50010 to replicate blk_5636039758999247483_31304886 to 
>> datanode(s)
>> 10.11
>>
>> 0.8.86:50010 10.110.8.69:50010
>>
>> 2010-05-18 21:21:41,941 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
>> NameSyst
>>
>> em.chooseExcessReplicates: (10.110.8.63:50010,
>> blk_5636039758999247483_31304886)
>>
>> is added to recentInvalidateSets
>>
>> 2010-05-18 21:21:43,995 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
>> ask
>> 10.1
>>
>> 10.8.63:50010 to delete  blk_5636039758999247483_31304886
>> blk_434931004890442915
>>
>> 7_31304519
>>
>> 2010-05-18 21:26:39,388 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
>> NameSyst
>>
>> em.addToInvalidates: blk_5636039758999247483 is added to invalidSet of
>> 10.110.8.
>>
>> 63:50010
>>
>> 2010-05-18 21:26:45,953 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
>> ask
>> 10.1
>>
>> 10.8.63:50010 to delete  blk_-1838286221287242082_31305179
>> blk_84467625641825134
>>
>> 17_31305143 blk_5636039758999247483_31304886
>> blk_4628640249731313760_31305046 bl
>>
>> k_7460947863067370701_31270225 blk_-4468681536500281247_31270225
>> blk_84535177111
>>
>> 01429609_31303917 blk_9126133835045521966_31303972
>> blk_4623110280826973929_31305
>>
>> 203 blk_-2581238696314957800_31305033 blk_7461125351290749755_31305052
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 2010-05-18 21:21:31,987 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
>> ask
>> 10.1
>>
>> 10.8.63:50010 to replicate blk_5636039758999247483_31304886 to 
>> datanode(s)
>> 10.11
>>
>> 0.8.86:50010 10.110.8.69:50010
>>
>> 2010-05-18 21:21:33,156 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
>> NameSyst
>>
>> em.addStoredBlock: blockMap updated: 10.110.8.69:50010 is added to
>> blk_563603975
>>
>> 8999247483_31304886 size 441
>>
>> 2010-05-18 21:21:57,835 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
>> NameSyst
>>
>> em.chooseExcessReplicates: (10.110.8.69:50010,
>> blk_5636039758999247483_31304886)
>>
>> is added to recentInvalidateSets
>>
>> 2010-05-18 21:21:59,005 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
>> ask
>> 10.1
>>
>> 10.8.69:50010 to delete  blk_5636039758999247483_31304886
>>
>> 2010-05-18 21:26:39,388 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
>> NameSyst
>>
>> em.addToInvalidates: blk_5636039758999247483 is added to invalidSet of
>> 10.110.8.
>>
>> 69:50010
>>
>> 2010-05-18 21:26:42,951 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
>> ask
>> 10.1
>>
>> 10.8.69:50010 to delete  blk_-2124965527858346013_31270213
>> blk_-5027506345849158
>>
>> 498_31270213 blk_5636039758999247483_31304886
>> blk_9148821113904458973_31305189 b
>>
>> lk_4850797749721229572_31305072 blk_252039065084461924_31305031
>> blk_-83518367280
>>
>> 09062091_31305208 blk_-7576696059515014894_31305194
>> blk_-2900250119736465962_312
>>
>> 70214 blk_471700613578524871_31304950 blk_-190744003190006044_31305064
>> blk_72650
>>
>> 57386742001625_31305073
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 2010-05-18 21:21:31,987 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
>> ask
>> 10.
>>
>> 10.8.63:50010 to replicate blk_5636039758999247483_31304886 to 
>> datanode(s)
>> 10.1
>>
>> 0.8.86:50010 10.110.8.69:50010
>>
>> 2010-05-18 21:21:41,941 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
>> NameSys
>>
>> em.addStoredBlock: blockMap updated: 10.110.8.86:50010 is added to
>> blk_56360397
>>
>> 8999247483_31304886 size 441
>>
>> 2010-05-18 21:26:39,388 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
>> NameSys
>>
>> em.addToInvalidates: blk_5636039758999247483 is added to invalidSet of
>> 10.110.8
>>
>> 86:50010
>>
>> 2010-05-18 21:26:42,951 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
>> ask
>> 10.
>>
>> 10.8.86:50010 to delete  blk_-6242136662924452584_31259201
>> blk_5636039758999247
>>
>> 83_31304886 blk_4850797749721229572_31305072
>> blk_252039065084461924_31305031
>> bl
>>
>> _-1317144678443645904_31305204 blk_6050185755706975664_31270230
>> blk_26714169718
>>
>> 5801868_31304948 blk_-5582352089328547938_31305022
>> blk_-3115115738671914626_312
>>
>> 0210
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 2010-05-18 21:21:34,413 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
>> NameSyst
>>
>> em.addStoredBlock: blockMap updated: 10.110.8.59:50010 is added to
>> blk_563603975
>>
>> 8999247483_31304886 size 441
>>
>> 2010-05-18 21:26:39,388 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
>> NameSyst
>>
>> em.addToInvalidates: blk_5636039758999247483 is added to invalidSet of
>> 10.110.8.
>>
>> 59:50010
>>
>> 2010-05-18 21:26:39,950 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
>> ask
>> 10.1
>>
>> 10.8.59:50010 to delete  blk_5636039758999247483_31304886
>> blk_-45285121566353996
>>
>> 25_31305212 blk_1439789418382469336_31305158
>> blk_8860574934531794641_31270219 bl
>>
>> k_-8358193301564392132_31305029
>>
>>
>>
>>
>>
>>
>>
>> 2010-05-18 21:21:57,835 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
>> NameSyst
>>
>> em.addStoredBlock: blockMap updated: 10.110.8.85:50010 is added to
>> blk_563603975
>>
>> 8999247483_31304886 size 441
>>
>> 2010-05-18 21:26:39,388 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
>> NameSyst
>>
>> em.addToInvalidates: blk_5636039758999247483 is added to invalidSet of
>> 10.110.8.
>>
>> 85:50010
>>
>> 2010-05-18 21:26:39,950 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
>> ask
>> 10.1
>>
>> 10.8.85:50010 to delete  blk_-6242136662924452584_31259201
>> blk_56360397589992474
>>
>> 83_31304886 blk_4628640249731313760_31305046
>> blk_4747588241975451642_31305123 bl
>>
>> k_-6876078628884993825_31270230 blk_-4468681536500281247_31270225
>> blk_7325830193
>>
>> 509411302_31270230 blk_8453517711101429609_31303917
>> blk_-6094734447689285387_313
>>
>> 05127 blk_3353439739797003235_31305037 blk_-5027506345849158498_31270213
>> blk_148
>>
>> 4161645992497144_31270225 blk_4464987648045469454_31305144
>> blk_74609478630673707
>>
>> 01_31270225 blk_-1170815606945644545_31270230
>> blk_6050185755706975664_31270230 b
>>
>> lk_-8358193301564392132_31305029 blk_2671416971885801868_31304948
>> blk_5593547375
>>
>> 459437465_31286511 blk_-2581238696314957800_31305033
>> blk_4732635559915402193_312
>>
>> 70230 blk_-2124965527858346013_31270213 blk_-5837992573431863412_31286612
>> blk_-4
>>
>> 32558447034944954_31270208 blk_-3407615138527189735_31305069
>> blk_886057493453179
>>
>> 4641_31270219 blk_233110856487529716_31270229
>> blk_312750273180273303_31270228 bl
>>
>> k_7461125351290749755_31305052 blk_-8902661185532055148_31304947
>> blk_-8555258258
>>
>> 738129670_31270210 blk_252039065084461924_31305031
>> blk_9037118763503479133_31305
>>
>> 120 blk_-8494656323754369174_31305105 blk_9126133835045521966_31303972
>> blk_-5582
>>
>> 352089328547938_31305022 blk_-2900250119736465962_31270214
>> blk_-3115115738671914
>>
>> 626_31270210 blk_7612090442234634555_31270225
>> blk_5876492007747505188_31270213 b
>>
>> lk_471700613578524871_31304950 blk_-190744003190006044_31305064
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> datanode 10.110.8.63:
>>
>>
>>
>> 2010-05-18 21:21:46,058 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: De
>>
>> leting block blk_5636039758999247483_31304886 file
>> /hadoop_data_dir/dfs/data/cur
>>
>> rent/subdir23/blk_5636039758999247483
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> hbase region server 10.110.8.85:
>>
>>
>>
>> DFSClient.java,
>>
>> DatanodeInfo chosenNode = bestNode(nodes, deadNodes);
>>
>> InetSocketAddress targetAddr =
>> NetUtils.createSocketAddr(chosenNode.getName());
>>
>> return new DNAddrPair(chosenNode, targetAddr);
>>
>>
>>
>> still picked the 10.110.8.63 even though the command is sent from name 
>> node
>> 21:21:43,995
>>
>> to delete the block, and it was executed at 21:21:46,058 . ?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 2010-05-18 21:21:46,188 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
>> connect
>>
>> to /10.110.8.63:50010 for file
>> /hbase/.META./1028785192/info/656097411976846533
>>
>> for block 5636039758999247483:java.io.IOException: Got error in response 
>> to
>> OP_
>>
>> READ_BLOCK for file /hbase/.META./1028785192/info/656097411976846533 for
>> block 5
>>
>> 636039758999247483
>>
>>
>>
>>
>>
>>
>>
>>
>
>
> -- 
> Todd Lipcon
> Software Engineer, Cloudera
> 
Mime
View raw message