accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Newton (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-2339) WAL recovery fails
Date Wed, 23 Apr 2014 01:57:15 GMT


Eric Newton commented on ACCUMULO-2339:

As far as I know, it is still an issue.  I was unsuccessful in making a test for it.  I should
probably go spelunking in the HBase tickets for something similar to see if they've run into
the issue.

We can close this as Cannot Reproduce.  I wanted to document that I saw it once, and the workaround:
to copy the block into HDFS.

> WAL recovery fails
> ------------------
>                 Key: ACCUMULO-2339
>                 URL:
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.5.0
>         Environment: testing 1.5.1rc1 on a 10 node cluster, hadoop 2.2.0, zk 3.4.5
>            Reporter: Eric Newton
>            Priority: Critical
> I was running accumulo 1.5.1rc1 on a 10 node cluster. After two days, I saw that several
tservers had died with OOME.  Several hundred tablets were offline.
> The master was attempting to recover the write lease on the file, and this was failing.
> Attempts to examine the log file failed: 
> {noformat}
> $ hadoop fs -cat /accumulo/wal/
> Cannot obtain block length for LocatedBlock{BP-901421341-;
getBlockSize()=0; corrupt=false; offset=0; locs=[]}
> {noformat}
> Looking at the DN logs, I see this:
> {noformat}
> 2014-02-06 12:48:35,798 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: NameNode
at host2/ calls recoverBlock(BP-901421341-,
targets=[], newGenerationStamp=2880680)
> 2014-02-06 12:48:35,798 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
initReplicaRecovery: blk_1076582290_2869721, recoveryId=2880680, replica=ReplicaBeingWritten,
blk_1076582290_2869721, RBW
>   getNumBytes()     = 634417185
>   getBytesOnDisk()  = 634417113
>   getVisibleLength()= 634417113
>   getVolume()       = /srv/hdfs4/hadoop/dn/current
>   getBlockFile()    = /srv/hdfs4/hadoop/dn/current/BP-901421341-
>   bytesAcked=634417113
>   bytesOnDisk=634417113
> {noformat}
> I'm guessing that the /srv/hdfs4 partition was filled up, and disagreement about the
size of the file and the size the DN thinks the file should be is causing failures.
> Restarting HDFS made no difference.
> I manually copied the block up into HDFS as the WAL to make any progress.

This message was sent by Atlassian JIRA

View raw message