Return-Path: X-Original-To: apmail-accumulo-notifications-archive@minotaur.apache.org Delivered-To: apmail-accumulo-notifications-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AE13E11F5F for ; Fri, 10 May 2013 20:07:15 +0000 (UTC) Received: (qmail 80014 invoked by uid 500); 10 May 2013 20:07:15 -0000 Delivered-To: apmail-accumulo-notifications-archive@accumulo.apache.org Received: (qmail 79994 invoked by uid 500); 10 May 2013 20:07:15 -0000 Mailing-List: contact notifications-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@apache.org Delivered-To: mailing list notifications@accumulo.apache.org Received: (qmail 79984 invoked by uid 99); 10 May 2013 20:07:15 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 May 2013 20:07:15 +0000 Date: Fri, 10 May 2013 20:07:15 +0000 (UTC) From: "Eric Newton (JIRA)" To: notifications@accumulo.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (ACCUMULO-1364) Silent failure after power outage MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/ACCUMULO-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13654786#comment-13654786 ] Eric Newton commented on ACCUMULO-1364: --------------------------------------- Aaaand... that failed. HDFS recovered healthy, but the WAL would not recover: {noformat} 2013-05-10 16:04:44,645 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: initReplicaRecovery: update recovery id for blk_-8960414614127036293_3369 from 3502 to 3503 2013-05-10 16:04:44,645 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: updateReplica: BP-1684116895-127.0.1.1-1368212963111:blk_-8960414614127036293_3369, recoveryId=3503, length=644894208, replica=ReplicaUnderRecovery, blk_-8960414614127036293_3369, RUR getNumBytes() = 644894208 getBytesOnDisk() = 644894208 getVisibleLength()= -1 getVolume() = /home/ecn/data/dfs/current getBlockFile() = /home/ecn/data/dfs/current/BP-1684116895-127.0.1.1-1368212963111/current/rbw/blk_-8960414614127036293 recoveryId=3503 original=ReplicaWaitingToBeRecovered, blk_-8960414614127036293_3369, RWR getNumBytes() = 644894208 getBytesOnDisk() = 644894208 getVisibleLength()= -1 getVolume() = /home/ecn/data/dfs/current getBlockFile() = /home/ecn/data/dfs/current/BP-1684116895-127.0.1.1-1368212963111/current/rbw/blk_-8960414614127036293 unlinked=false {noformat} > Silent failure after power outage > --------------------------------- > > Key: ACCUMULO-1364 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1364 > Project: Accumulo > Issue Type: Sub-task > Components: master, tserver > Environment: hadoop-1.0.4, accumulo-1.5-SNAPSHOT svn version 1470047 > Reporter: John Vines > Assignee: Eric Newton > Priority: Blocker > Fix For: 1.5.0 > > > We were doing some testing on an Accumulo snapshot using continuous ingest when the power went out. When it came back we noticed some corrupt blocks in HDFS, mostly around the WAL. I wasn't certain if that was a happenstance of how the sync blocks can turn out, so I went ahead and started Accumulo to see if it could handle it. What I got wasn't what I expected. > There are 0 errors reported on the monitor. It just sits with 5 tservers available and no tablets online. The master appears it attempted to assign and then is waiting for the walog to close, which never happens- > {quote} 2013-04-30 10:38:23,648 [master.EventCoordinator] INFO : There are now 5 tablet servers > 2013-04-30 10:38:23,719 [state.ZooTabletStateStore] DEBUG: root tablet logSet [172.16.102.202+9997/fa545e93-5eba-46b4-9266-dbd60cb56943] > 2013-04-30 10:38:23,720 [state.ZooTabletStateStore] DEBUG: root tablet logSet [172.16.102.202+9997/ed30bd24-b348-4344-8614-a2d79f933462] > 2013-04-30 10:38:23,725 [state.ZooTabletStateStore] DEBUG: Returning root tablet state: !0;!0<<@(null,172.16.102.202:9997[33e57eff04c0001],172.16.102.202:9997[33e57eff04c0001]) > 2013-04-30 10:38:23,740 [master.Master] INFO : Loaded class : org.apache.accumulo.server.master.recovery.HadoopLogCloser > 2013-04-30 10:38:23,741 [recovery.RecoveryManager] INFO : Starting recovery of ed30bd24-b348-4344-8614-a2d79f933462 (in : 10s) created for 172.16.102.202+9997, tablet !0;!0<< holds a reference > 2013-04-30 10:38:23,751 [master.Master] DEBUG: [Root Tablet]: scan time 0.04 seconds > 2013-04-30 10:38:23,751 [master.Master] DEBUG: [Root Tablet] sleeping for 60.00 seconds > 2013-04-30 10:38:23,823 [metrics.MetricsConfiguration] DEBUG: Loading config file: /cloud/accumulo/apache-accumulo-1.5.0-SNAPSHOT_1470047/conf/accumulo-metrics.xml > 2013-04-30 10:38:23,838 [master.Master] DEBUG: Finished gathering information from 5 servers in 0.21 seconds > 2013-04-30 10:38:23,841 [master.Master] DEBUG: not balancing because there are unhosted tablets > 2013-04-30 10:38:23,852 [master.Master] DEBUG: Finished gathering information from 5 servers in 0.01 seconds > 2013-04-30 10:38:23,852 [master.Master] DEBUG: not balancing because there are unhosted tablets > 2013-04-30 10:38:23,861 [metrics.MetricsConfiguration] DEBUG: Metrics collection enabled=false > 2013-04-30 10:38:23,874 [impl.ThriftScanner] DEBUG: Error getting transport to 172.16.102.202:9997 : NotServingTabletException(extent:TKeyExtent(table:21 30, endRow:21 30 3C, prevEndRow:null)) > {quote} > That Exception repeats endlessly with periodic > bq. 2013-04-30 10:38:34,756 [recovery.HadoopLogCloser] INFO : Waiting for file to be closed /accumulo/wal/172.16.102.202+9997/ed30bd24-b348-4344-8614-a2d79f933462 > On the tserver in question, it seems to have no idea that it's supposed to be recovering the root tablet though > {quote} > 2013-04-30 10:38:22,432 [tabletserver.TabletServer] DEBUG: org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler created > 2013-04-30 10:38:22,544 [metrics.MetricsConfiguration] DEBUG: Loading config file: /cloud/accumulo/apache-accumulo-1.5.0-SNAPSHOT_1470047/conf/accumu > lo-metrics.xml > 2013-04-30 10:38:22,549 [metrics.MetricsConfiguration] DEBUG: Metrics collection enabled=false > 2013-04-30 10:38:22,551 [tabletserver.TabletServer] INFO : port = 9997 > 2013-04-30 10:38:22,621 [tabletserver.TabletServer] DEBUG: Obtained tablet server lock /accumulo/242078a7-dd19-4d08-8952-f5109f6f7962/tservers/172.16 > .102.202:9997/zlock-0000000000 > 2013-04-30 10:38:23,266 [tabletserver.TabletServer] DEBUG: gc ParNew=0.00(+0.00) secs ConcurrentMarkSweep=0.00(+0.00) secs freemem=8,486,794,504(+45, > 036,880) totalmem=8,536,260,608 > 2013-04-30 10:38:23,947 [tabletserver.TabletServer] DEBUG: MultiScanSess 172.16.102.200:50034 0 entries in 0.07 secs (lookup_time:0.00 secs tablets:1 > ranges:1) > 2013-04-30 10:38:23,986 [tabletserver.TabletServer] DEBUG: MultiScanSess 172.16.102.200:50034 0 entries in 0.00 secs (lookup_time:0.00 secs tablets:1 > ranges:1) > {quote} > With that debug message repeating endlessly. Out and err files on the master and that tserver are empty. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira