hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sanjay Radia (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4810) Data lost at cluster startup time
Date Fri, 12 Dec 2008 22:58:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12656191#action_12656191
] 

Sanjay Radia commented on HADOOP-4810:
--------------------------------------

Looks good. 4 things:

1) Change the comment
              // Delete new replica.
   to 
           // mark  new replica as corrupt

2) for each of the cases check to see if the lease is open. If lease is not open, log an error
that we got a length mismatch even when the file was not open.
  Also file a jira for the case when the lease is not open to perhaps write to edits log to
record the new length (I am not sure if writing the new length
is right or wrong but we can  debate this on that jira).

3) Your fix will not let us distinguish  between true corruption caused by some bug in HDFS,
 and the normal mismatch that can occur during appends when 
a client dies (I am not sure of this but that is my recollection from the append discussions
with Dhruba last year at yahoo).
This is okay for now. But let us file a jira to fix this so that we can distinguish.
The easy code fix for this is to add a field to internal data structure to record the original
length in fsimage - but this will increase the usage of memory in the system
since the 4 bytes will be multiplied by the number of  logical blocks in the system.

4) In my opinion the correct behavior for shorter blocks (but longer than the fsimage recorded
length) is to invalidate as in the original code - however our invalidation code does not
handle the case because if the "corrupt" block is the last one it keeps it as valid.  Thus
your patch is a good emergency fix to this very critical problem.
 I suggest that we  file a jira to handle invaliding such invalid blocks correctly.
Note here I am distinguishing between *corrupt* blocks (caused by hardware errors or by bugs
in our software) and *invalid* blocks (those lenght mismatches that
can occur due to client or other failures). Others may not share the distinction I make -
lets debate that in the jira; we need to get this patch out ASAP.




> Data lost at cluster startup time
> ---------------------------------
>
>                 Key: HADOOP-4810
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4810
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.2
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.18.3
>
>         Attachments: corruptBlocksStartup.patch
>
>
> hadoop dfs -cat file1 returns
> dfs.DFSClient: Could not obtain block blk_XX_0 from any node: java.io.IOException: No
live nodes contain current block
> Tracing the history of the block from NN log, we found
>  WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-6160940519231606858_0
reported from A1.A2.A3.A4:50010 current size is 9303872 reported size is 262144
>  WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_-6160940519231606858_0 from
A1.A2.A3.A4:50010
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_-6160940519231606858_0
on A1.A2.A3.A4:50010 
> WARN org.apache.hadoop.fs.FSNamesystem: Error in deleting bad block blk_-6160940519231606858_0
org.apache.hadoop.dfs.SafeModeException: Cannot invalidate block blk_-6160940519231606858_0.
Name node is in safe mode. 
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-6160940519231606858_0
reported from B1.B2.B3.B4:50010 current size is 9303872 reported size is 306688 
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_-6160940519231606858_0 from
B1.B2.B3.B4:50010 
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_-6160940519231606858_0
on B1.B2.B3.B4:50010 
> WARN org.apache.hadoop.fs.FSNamesystem: Error in deleting bad block blk_-6160940519231606858_0
org.apache.hadoop.dfs.SafeModeException: Cannot invalidate block blk_-6160940519231606858_0.
Name node is in safe mode. 
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.chooseExcessReplicates: (C1.C2.C3.C4:50010,
blk_-6160940519231606858_0) is added to recentInvalidateSets 
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.chooseExcessReplicates: (D1.D2.D3.D4:50010,
blk_-6160940519231606858_0) is added to recentInvalidateSets
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask C1.C2.C3.C4:50010 to delete blk_-6160940519231606858_0
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask D1.D2.D3.D4:50010 to delete blk_-6160940519231606858_0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message