hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "patrick white (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3771) Namenode can't restart due to corrupt edit logs, timing issue with shutdown and edit log rolling
Date Tue, 07 Aug 2012 19:36:12 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13430561#comment-13430561
] 

patrick white commented on HDFS-3771:
-------------------------------------

Hi Todd, right, the TXIDs are correct (unlike my line item numbering in the Description) here's
the verbatim log snippet with classpath removed for clarity, hostnames and ipaddr generalized;

[org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor@3ecec78d]2012-08-06
13:02:52,216 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from
[Secondary NN]
[IPC Server handler 70 on 8020]2012-08-06 13:02:52,217 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog:
Rolling edit logs.
[IPC Server handler 70 on 8020]2012-08-06 13:02:52,217 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog:
Ending log segment 23963
[IPC Server handler 70 on 8020]2012-08-06 13:02:52,218 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog:
Number of transactions: 4 Total time for transactions(ms): 1Number of transactions batched
in Syncs: 0 Number of syncs: 5       SyncTimes(ms): 23
[IPC Server handler 70 on 8020]2012-08-06 13:02:52,220 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog:
Starting log segment at 23967
[IPC Server handler 70 on 8020]2012-08-06 13:02:52,234 INFO org.apache.hadoop.hdfs.server.namenode.NameNode:
SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at [HOSTNAME/IPADDR]
************************************************************/
[Thread-1]2012-08-06 13:03:39,287 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = [HOSTNAME/IPADDR]
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.23.3.1208042202

STARTUP_MSG:   classpath = << CLASSPATH REMOVED >>

STARTUP_MSG:   build = http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.23/hadoop-common-project/hadoop-common
-r 1368004; compiled by '[QEUSER]' on Sat Aug  4 22:15:58 PDT 2012
 ************************************************************/
  [main]2012-08-06 13:03:39,506 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded
properties from hadoop-metrics2.properties
  [main]2012-08-06 13:03:39,725 INFO org.apache.hadoop.metrics2.impl.MetricsSinkAdapter: Sink
simon_jvm started
  [main]2012-08-06 13:03:39,753 INFO org.apache.hadoop.metrics2.impl.MetricsSinkAdapter: Sink
simon_rpc started
  [main]2012-08-06 13:03:39,800 INFO org.apache.hadoop.metrics2.impl.MetricsSinkAdapter: Sink
simon_dfs started
  [main]2012-08-06 13:03:39,861 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled
snapshot period at 10 second(s).
  [main]2012-08-06 13:03:39,861 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode
metrics system started
  [main]2012-08-06 13:03:40,342 INFO org.apache.hadoop.security.UserGroupInformation: Login
successful for user hdfs/HOSTNAME@DOMAIN using keytab file [KEYTAB]
  [main]2012-08-06 13:03:40,464 INFO org.apache.hadoop.util.HostsFileReader: Refreshing hosts
(include/exclude) list
  [main]2012-08-06 13:03:40,469 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager:
dfs.block.invalidate.limit=1000
  [main]2012-08-06 13:03:40,472 INFO org.apache.hadoop.hdfs.util.GSet: VM type       = 64-bit
  [main]2012-08-06 13:03:40,472 INFO org.apache.hadoop.hdfs.util.GSet: 2% max memory = 273.85625
MB
  [main]2012-08-06 13:03:40,472 INFO org.apache.hadoop.hdfs.util.GSet: capacity      = 2^25
= 33554432 entries
  [main]2012-08-06 13:03:40,472 INFO org.apache.hadoop.hdfs.util.GSet: recommended=33554432,
actual=33554432
  [main]2012-08-06 13:03:40,674 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
dfs.block.access.token.enable=true
  [main]2012-08-06 13:03:40,674 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
dfs.block.access.key.update.interval=600 min(s), dfs.block.access.token.lifetime=600 min(s)
  [main]2012-08-06 13:03:40,683 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
defaultReplication = 3
  [main]2012-08-06 13:03:40,683 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
maxReplication     = 50
  [main]2012-08-06 13:03:40,684 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
minReplication     = 1
  [main]2012-08-06 13:03:40,684 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
maxReplicationStreams      = 2
  [main]2012-08-06 13:03:40,684 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
shouldCheckForEnoughRacks  = true
  [main]2012-08-06 13:03:40,684 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
replicationRecheckInterval = 3000
  [main]2012-08-06 13:03:40,684 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
fsOwner=hdfs/[HOSTNAME@DOMAIN] (auth:KERBEROS)
  [main]2012-08-06 13:03:40,684 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
supergroup=hdfs
  [main]2012-08-06 13:03:40,684 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
isPermissionEnabled=true
  [main]2012-08-06 13:03:40,692 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
Registered FSNamesystemState MBean
  [main]2012-08-06 13:03:40,732 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Caching
file names occuring more than 10 times
  [main]2012-08-06 13:03:40,810 WARN org.apache.hadoop.hdfs.server.namenode.FSImageTransactionalStorageInspector:
Logs beginning at txid  were are all in-progress (probably truncated due to a previous NameNode
crash)
  [main]2012-08-06 13:03:40,817 WARN org.apache.hadoop.hdfs.server.namenode.FSImageTransactionalStorageInspector:
Marking log at /grid/[PATH]/edits_inprogress_0000000000000023967 as corrupt since it has no
tra      nsactions in it.
  [main]2012-08-06 13:03:40,818 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Exception
in namenode join
  [main]java.lang.IllegalStateException: No non-corrupt logs for txid 23967
         at org.apache.hadoop.hdfs.server.namenode.FSImageTransactionalStorageInspector$LogGroup.getBestNonCorruptLog(FSImageTransactionalStorageInspector.java:302)
         at org.apache.hadoop.hdfs.server.namenode.FSImageTransactionalStorageInspector.createLogLoadPlan(FSImageTransactionalStorageInspector.java:230)
         at org.apache.hadoop.hdfs.server.namenode.FSImageTransactionalStorageInspector.createLoadPlan(FSImageTransactionalStorageInspector.java:178)
         at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:592)
         at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:276)
         at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:175)
         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:333)
         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:305)
         at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:299)
         at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:333)
         at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:459)
         at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:451)
         at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:752)
         at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:800)
2012-08-06 13:03:40,820 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at [HOSTNAME/IPADDR]
************************************************************/

                
> Namenode can't restart due to corrupt edit logs, timing issue with shutdown and edit
log rolling
> ------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-3771
>                 URL: https://issues.apache.org/jira/browse/HDFS-3771
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.23.3, 2.0.0-alpha
>         Environment: QE, 20 node Federated cluster with 3 NNs and 15 DNs, using Kerberos
based security
>            Reporter: patrick white
>            Priority: Critical
>
> Our 0.23.3 nightly HDFS regression suite encountered a particularly nasty issue recently,
which resulted in the cluster's default Namenode being unable to restart, this was on a 20
node Federated cluster with security. The cause appears to be that the NN was just starting
to roll its edit log when a shutdown occurred, the shutdown was intentional to restart the
cluster as part of an automated test.
> The tests that were running do not appear to be the issue in themselves, the cluster
was just wrapping up an adminReport subset and this failure case has not reproduce so far,
nor was it failing previously. It looks like a chance occurrence of sending the shutdown just
as the edit log roll was begun.
> From the NN log, the following sequence is noted:
> 1. an InvalidateBlocks operation had completed
> 2. FSNamesystem: Roll Edit Log from [Secondary Namenode IPaddr]
> 3. FSEditLog: Ending log segment 23963
> 4. FSEditLog: Starting log segment at 23967
> 4. NameNode: SHUTDOWN_MSG
> => the NN shuts down and then is restarted...
> 5. FSImageTransactionalStorageInspector: Logs beginning at txid 23967 were are all in-progress
> 6. FSImageTransactionalStorageInspector: Marking log at /grid/[PATH]/edits_inprogress_0000000000000023967
as corrupt since it has no transactions in it.
> 7. NameNode: Exception in namenode join [main]java.lang.IllegalStateException: No non-corrupt
logs for txid 23967
> => NN start attempts continue to cycle trying to restart but can't, failing on the
same exception due to lack of non-corrupt edit logs
> If observations are correct and issue is from shutdown happening as edit logs are rolling,
does the NN have an equivalent to the conventional fs 'sync' blocking action that should be
called, or perhaps has a timing hole?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message