accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Vines (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ACCUMULO-1364) Silent failure after power outage
Date Tue, 30 Apr 2013 15:18:15 GMT
John Vines created ACCUMULO-1364:
------------------------------------

             Summary: Silent failure after power outage
                 Key: ACCUMULO-1364
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1364
             Project: Accumulo
          Issue Type: Bug
          Components: master, tserver
         Environment: hadoop-1.0.4, accumulo-1.5-SNAPSHOT svn version 1470047
            Reporter: John Vines
            Assignee: Eric Newton
            Priority: Blocker
             Fix For: 1.5.0


We were doing some testing on an Accumulo snapshot using continuous ingest when the power
went out. When it came back we noticed some corrupt blocks in HDFS, mostly around the WAL.
I wasn't certain if that was a happenstance of how the sync blocks can turn out, so I went
ahead and started Accumulo to see if it could handle it. What I got wasn't what I expected.

There are 0 errors reported on the monitor. It just sits with 5 tservers available and no
tablets online. The master appears it attempted to assign and then is waiting for the walog
to close, which never happens-
{quote} 2013-04-30 10:38:23,648 [master.EventCoordinator] INFO : There are now 5 tablet servers
2013-04-30 10:38:23,719 [state.ZooTabletStateStore] DEBUG: root tablet logSet [172.16.102.202+9997/fa545e93-5eba-46b4-9266-dbd60cb56943]
2013-04-30 10:38:23,720 [state.ZooTabletStateStore] DEBUG: root tablet logSet [172.16.102.202+9997/ed30bd24-b348-4344-8614-a2d79f933462]
2013-04-30 10:38:23,725 [state.ZooTabletStateStore] DEBUG: Returning root tablet state: !0;!0<<@(null,172.16.102.202:9997[33e57eff04c0001],172.16.102.202:9997[33e57eff04c0001])
2013-04-30 10:38:23,740 [master.Master] INFO : Loaded class : org.apache.accumulo.server.master.recovery.HadoopLogCloser
2013-04-30 10:38:23,741 [recovery.RecoveryManager] INFO : Starting recovery of ed30bd24-b348-4344-8614-a2d79f933462
(in : 10s) created for 172.16.102.202+9997, tablet !0;!0<< holds a reference
2013-04-30 10:38:23,751 [master.Master] DEBUG: [Root Tablet]: scan time 0.04 seconds
2013-04-30 10:38:23,751 [master.Master] DEBUG: [Root Tablet] sleeping for 60.00 seconds
2013-04-30 10:38:23,823 [metrics.MetricsConfiguration] DEBUG: Loading config file: /cloud/accumulo/apache-accumulo-1.5.0-SNAPSHOT_1470047/conf/accumulo-metrics.xml
2013-04-30 10:38:23,838 [master.Master] DEBUG: Finished gathering information from 5 servers
in 0.21 seconds
2013-04-30 10:38:23,841 [master.Master] DEBUG: not balancing because there are unhosted tablets
2013-04-30 10:38:23,852 [master.Master] DEBUG: Finished gathering information from 5 servers
in 0.01 seconds
2013-04-30 10:38:23,852 [master.Master] DEBUG: not balancing because there are unhosted tablets
2013-04-30 10:38:23,861 [metrics.MetricsConfiguration] DEBUG: Metrics collection enabled=false
2013-04-30 10:38:23,874 [impl.ThriftScanner] DEBUG: Error getting transport to 172.16.102.202:9997
: NotServingTabletException(extent:TKeyExtent(table:21 30, endRow:21 30 3C, prevEndRow:null))
 {quote}
That Exception repeats endlessly with periodic
bq. 2013-04-30 10:38:34,756 [recovery.HadoopLogCloser] INFO : Waiting for file to be closed
/accumulo/wal/172.16.102.202+9997/ed30bd24-b348-4344-8614-a2d79f933462


On the tserver in question, it seems to have no idea that it's supposed to be recovering the
root tablet though
{quote}
2013-04-30 10:38:22,432 [tabletserver.TabletServer] DEBUG: org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler
created
2013-04-30 10:38:22,544 [metrics.MetricsConfiguration] DEBUG: Loading config file: /cloud/accumulo/apache-accumulo-1.5.0-SNAPSHOT_1470047/conf/accumu
lo-metrics.xml
2013-04-30 10:38:22,549 [metrics.MetricsConfiguration] DEBUG: Metrics collection enabled=false
2013-04-30 10:38:22,551 [tabletserver.TabletServer] INFO : port = 9997
2013-04-30 10:38:22,621 [tabletserver.TabletServer] DEBUG: Obtained tablet server lock /accumulo/242078a7-dd19-4d08-8952-f5109f6f7962/tservers/172.16
.102.202:9997/zlock-0000000000
2013-04-30 10:38:23,266 [tabletserver.TabletServer] DEBUG: gc ParNew=0.00(+0.00) secs ConcurrentMarkSweep=0.00(+0.00)
secs freemem=8,486,794,504(+45,
036,880) totalmem=8,536,260,608
2013-04-30 10:38:23,947 [tabletserver.TabletServer] DEBUG: MultiScanSess 172.16.102.200:50034
0 entries in 0.07 secs (lookup_time:0.00 secs tablets:1
 ranges:1) 
2013-04-30 10:38:23,986 [tabletserver.TabletServer] DEBUG: MultiScanSess 172.16.102.200:50034
0 entries in 0.00 secs (lookup_time:0.00 secs tablets:1
 ranges:1) 
{quote}
With that debug message repeating endlessly. Out and err files on the master and that tserver
are empty.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message