Return-Path: X-Original-To: apmail-accumulo-notifications-archive@minotaur.apache.org Delivered-To: apmail-accumulo-notifications-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 42CBB112A0 for ; Tue, 23 Sep 2014 15:23:20 +0000 (UTC) Received: (qmail 4136 invoked by uid 500); 23 Sep 2014 14:56:35 -0000 Delivered-To: apmail-accumulo-notifications-archive@accumulo.apache.org Received: (qmail 4106 invoked by uid 500); 23 Sep 2014 14:56:35 -0000 Mailing-List: contact notifications-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@apache.org Delivered-To: mailing list notifications@accumulo.apache.org Received: (qmail 4023 invoked by uid 99); 23 Sep 2014 14:56:35 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Sep 2014 14:56:35 +0000 Date: Tue, 23 Sep 2014 14:56:34 +0000 (UTC) From: "Eric Newton (JIRA)" To: notifications@accumulo.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Resolved] (ACCUMULO-2339) WAL recovery fails MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/ACCUMULO-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Newton resolved ACCUMULO-2339. ----------------------------------- Resolution: Cannot Reproduce I've tried to reliably reproduce this problem, but I can't. I've only seen it twice, and always under a full HDFS. > WAL recovery fails > ------------------ > > Key: ACCUMULO-2339 > URL: https://issues.apache.org/jira/browse/ACCUMULO-2339 > Project: Accumulo > Issue Type: Bug > Components: tserver > Affects Versions: 1.5.0 > Environment: testing 1.5.1rc1 on a 10 node cluster, hadoop 2.2.0, zk 3.4.5 > Reporter: Eric Newton > Assignee: Eric Newton > Priority: Critical > > I was running accumulo 1.5.1rc1 on a 10 node cluster. After two days, I saw that several tservers had died with OOME. Several hundred tablets were offline. > The master was attempting to recover the write lease on the file, and this was failing. > Attempts to examine the log file failed: > {noformat} > $ hadoop fs -cat /accumulo/wal/192.168.1.5+9997/bc94602a-9a57-45f6-afdf-ffa2a5b70b14 > Cannot obtain block length for LocatedBlock{BP-901421341-192.168.1.3-1389719663617:blk_1076582460_2869891; getBlockSize()=0; corrupt=false; offset=0; locs=[192.168.1.5:50010]} > {noformat} > Looking at the DN logs, I see this: > {noformat} > 2014-02-06 12:48:35,798 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: NameNode at host2/192.168.1.3:9000 calls recoverBlock(BP-901421341-192.168.1.3-1389719663617:blk_1076582290_2869721, targets=[192.168.1.5:50010], newGenerationStamp=2880680) > 2014-02-06 12:48:35,798 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: initReplicaRecovery: blk_1076582290_2869721, recoveryId=2880680, replica=ReplicaBeingWritten, blk_1076582290_2869721, RBW > getNumBytes() = 634417185 > getBytesOnDisk() = 634417113 > getVisibleLength()= 634417113 > getVolume() = /srv/hdfs4/hadoop/dn/current > getBlockFile() = /srv/hdfs4/hadoop/dn/current/BP-901421341-192.168.1.3-1389719663617/current/rbw/blk_1076582290 > bytesAcked=634417113 > bytesOnDisk=634417113 > {noformat} > I'm guessing that the /srv/hdfs4 partition was filled up, and disagreement about the size of the file and the size the DN thinks the file should be is causing failures. > Restarting HDFS made no difference. > I manually copied the block up into HDFS as the WAL to make any progress. -- This message was sent by Atlassian JIRA (v6.3.4#6332)