accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Newton (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ACCUMULO-2339) WAL recovery fails
Date Fri, 07 Feb 2014 21:14:21 GMT
Eric Newton created ACCUMULO-2339:
-------------------------------------

             Summary: WAL recovery fails
                 Key: ACCUMULO-2339
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2339
             Project: Accumulo
          Issue Type: New Feature
          Components: tserver
         Environment: testing 1.5.1rc1 on a 10 node cluster, hadoop 2.2.0, zk 3.4.5
            Reporter: Eric Newton
            Priority: Critical


I was running accumulo 1.5.1rc1 on a 10 node cluster. After two days, I saw that several tservers
had died with OOME.  Several hundred tablets were offline.

The master was attempting to recover the write lease on the file, and this was failing.

Attempts to examine the log file failed: 

{noformat}
$ hadoop fs -cat /accumulo/wal/192.168.1.5+9997/bc94602a-9a57-45f6-afdf-ffa2a5b70b14
Cannot obtain block length for LocatedBlock{BP-901421341-192.168.1.3-1389719663617:blk_1076582460_2869891;
getBlockSize()=0; corrupt=false; offset=0; locs=[192.168.1.5:50010]}
{noformat}

Looking at the DN logs, I see this:
{noformat}
2014-02-06 12:48:35,798 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: NameNode at
host2/192.168.1.3:9000 calls recoverBlock(BP-901421341-192.168.1.3-1389719663617:blk_1076582290_2869721,
targets=[192.168.1.5:50010], newGenerationStamp=2880680)
2014-02-06 12:48:35,798 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
initReplicaRecovery: blk_1076582290_2869721, recoveryId=2880680, replica=ReplicaBeingWritten,
blk_1076582290_2869721, RBW
  getNumBytes()     = 634417185
  getBytesOnDisk()  = 634417113
  getVisibleLength()= 634417113
  getVolume()       = /srv/hdfs4/hadoop/dn/current
  getBlockFile()    = /srv/hdfs4/hadoop/dn/current/BP-901421341-192.168.1.3-1389719663617/current/rbw/blk_1076582290
  bytesAcked=634417113
  bytesOnDisk=634417113
{noformat}

I'm guessing that the /srv/hdfs4 partition was filled up, and disagreement about the size
of the file and the size the DN thinks the file should be is causing failures.

Restarting HDFS made no difference.

I manually copied the block up into HDFS as the WAL to make any progress.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message