accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-1364) Silent failure after power outage
Date Wed, 08 May 2013 22:17:16 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13652412#comment-13652412
] 

Josh Elser commented on ACCUMULO-1364:
--------------------------------------

bq. My open question is whether we should hold the release of 1.5.0 until we can diagnose
(or fail to replicate).

My decision would be based on how much time is wanted to investigate. Obviously we're not
there yet, but if this is going to be a few weeks or a month, I wouldn't want to wait.

bq. our system clearly doesn't meet the level of reliability that we claim it meets

I didn't think we advertise any redundancy above what Hadoop provides (in talking 1.5), so
I don't really see that as false advertising (to put words in your mouth). By that measure,
1.4 was less redundant because we had only two copies of a WAL instead of the Hadoop default
of 3.

My view on this is that if you don't have the ability to say "oops, gotta reprocess" when
faced with catastrophic, power failure, you should probably have a UPS to give you enough
time to issue an orderly shutdown. Yes, it would be nice to figure out exactly what happened,
but I know I don't have the knowledge/expertise to fully track the down the issue (nor the
ability to look at your nodes to figure out what happened), and I'm not going to volunteer
Eric (or anyone else, for that matter) to try and guess. Is someone other than just John looking
into this (getting back to your 'should we delay 1.5' question).
                
> Silent failure after power outage
> ---------------------------------
>
>                 Key: ACCUMULO-1364
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1364
>             Project: Accumulo
>          Issue Type: Sub-task
>          Components: master, tserver
>         Environment: hadoop-1.0.4, accumulo-1.5-SNAPSHOT svn version 1470047
>            Reporter: John Vines
>            Assignee: Eric Newton
>            Priority: Blocker
>             Fix For: 1.5.0
>
>
> We were doing some testing on an Accumulo snapshot using continuous ingest when the power
went out. When it came back we noticed some corrupt blocks in HDFS, mostly around the WAL.
I wasn't certain if that was a happenstance of how the sync blocks can turn out, so I went
ahead and started Accumulo to see if it could handle it. What I got wasn't what I expected.
> There are 0 errors reported on the monitor. It just sits with 5 tservers available and
no tablets online. The master appears it attempted to assign and then is waiting for the walog
to close, which never happens-
> {quote} 2013-04-30 10:38:23,648 [master.EventCoordinator] INFO : There are now 5 tablet
servers
> 2013-04-30 10:38:23,719 [state.ZooTabletStateStore] DEBUG: root tablet logSet [172.16.102.202+9997/fa545e93-5eba-46b4-9266-dbd60cb56943]
> 2013-04-30 10:38:23,720 [state.ZooTabletStateStore] DEBUG: root tablet logSet [172.16.102.202+9997/ed30bd24-b348-4344-8614-a2d79f933462]
> 2013-04-30 10:38:23,725 [state.ZooTabletStateStore] DEBUG: Returning root tablet state:
!0;!0<<@(null,172.16.102.202:9997[33e57eff04c0001],172.16.102.202:9997[33e57eff04c0001])
> 2013-04-30 10:38:23,740 [master.Master] INFO : Loaded class : org.apache.accumulo.server.master.recovery.HadoopLogCloser
> 2013-04-30 10:38:23,741 [recovery.RecoveryManager] INFO : Starting recovery of ed30bd24-b348-4344-8614-a2d79f933462
(in : 10s) created for 172.16.102.202+9997, tablet !0;!0<< holds a reference
> 2013-04-30 10:38:23,751 [master.Master] DEBUG: [Root Tablet]: scan time 0.04 seconds
> 2013-04-30 10:38:23,751 [master.Master] DEBUG: [Root Tablet] sleeping for 60.00 seconds
> 2013-04-30 10:38:23,823 [metrics.MetricsConfiguration] DEBUG: Loading config file: /cloud/accumulo/apache-accumulo-1.5.0-SNAPSHOT_1470047/conf/accumulo-metrics.xml
> 2013-04-30 10:38:23,838 [master.Master] DEBUG: Finished gathering information from 5
servers in 0.21 seconds
> 2013-04-30 10:38:23,841 [master.Master] DEBUG: not balancing because there are unhosted
tablets
> 2013-04-30 10:38:23,852 [master.Master] DEBUG: Finished gathering information from 5
servers in 0.01 seconds
> 2013-04-30 10:38:23,852 [master.Master] DEBUG: not balancing because there are unhosted
tablets
> 2013-04-30 10:38:23,861 [metrics.MetricsConfiguration] DEBUG: Metrics collection enabled=false
> 2013-04-30 10:38:23,874 [impl.ThriftScanner] DEBUG: Error getting transport to 172.16.102.202:9997
: NotServingTabletException(extent:TKeyExtent(table:21 30, endRow:21 30 3C, prevEndRow:null))
>  {quote}
> That Exception repeats endlessly with periodic
> bq. 2013-04-30 10:38:34,756 [recovery.HadoopLogCloser] INFO : Waiting for file to be
closed /accumulo/wal/172.16.102.202+9997/ed30bd24-b348-4344-8614-a2d79f933462
> On the tserver in question, it seems to have no idea that it's supposed to be recovering
the root tablet though
> {quote}
> 2013-04-30 10:38:22,432 [tabletserver.TabletServer] DEBUG: org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler
created
> 2013-04-30 10:38:22,544 [metrics.MetricsConfiguration] DEBUG: Loading config file: /cloud/accumulo/apache-accumulo-1.5.0-SNAPSHOT_1470047/conf/accumu
> lo-metrics.xml
> 2013-04-30 10:38:22,549 [metrics.MetricsConfiguration] DEBUG: Metrics collection enabled=false
> 2013-04-30 10:38:22,551 [tabletserver.TabletServer] INFO : port = 9997
> 2013-04-30 10:38:22,621 [tabletserver.TabletServer] DEBUG: Obtained tablet server lock
/accumulo/242078a7-dd19-4d08-8952-f5109f6f7962/tservers/172.16
> .102.202:9997/zlock-0000000000
> 2013-04-30 10:38:23,266 [tabletserver.TabletServer] DEBUG: gc ParNew=0.00(+0.00) secs
ConcurrentMarkSweep=0.00(+0.00) secs freemem=8,486,794,504(+45,
> 036,880) totalmem=8,536,260,608
> 2013-04-30 10:38:23,947 [tabletserver.TabletServer] DEBUG: MultiScanSess 172.16.102.200:50034
0 entries in 0.07 secs (lookup_time:0.00 secs tablets:1
>  ranges:1) 
> 2013-04-30 10:38:23,986 [tabletserver.TabletServer] DEBUG: MultiScanSess 172.16.102.200:50034
0 entries in 0.00 secs (lookup_time:0.00 secs tablets:1
>  ranges:1) 
> {quote}
> With that debug message repeating endlessly. Out and err files on the master and that
tserver are empty.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message