accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (ACCUMULO-2963) ReplicationDriver daemon dies from RTE thrown out of BatchScanner
Date Tue, 01 Jul 2014 05:32:24 GMT

     [ https://issues.apache.org/jira/browse/ACCUMULO-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Josh Elser resolved ACCUMULO-2963.
----------------------------------

    Resolution: Fixed

> ReplicationDriver daemon dies from RTE thrown out of BatchScanner
> -----------------------------------------------------------------
>
>                 Key: ACCUMULO-2963
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2963
>             Project: Accumulo
>          Issue Type: Bug
>          Components: replication
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>             Fix For: 1.7.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Saw failure on build server where replication didn't happen in an integration test. A
tablet server was restarted as a part of this test.
> As the tabletserver was starting back up, the Master was trying to scan the ReplicationTable.
Before the tserver came up "completely" (not sure on details), the Master starting getting
repeated RuntimeExceptions
> {noformat}
> Exception in thread "Replication Driver" java.lang.RuntimeException: org.apache.accumulo.core.client.AccumuloSecurityException:
Error DEFAULT_SECURITY_ERROR for user !SYSTEM on table replication(ID:3) - Unknown security
exception
>         at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.hasNext(TabletServerBatchReaderIterator.java:182)
>         at org.apache.accumulo.master.replication.RemoveCompleteReplicationRecords.removeCompleteRecords(RemoveCompleteReplicationRecords.java:124)
>         at org.apache.accumulo.master.replication.RemoveCompleteReplicationRecords.run(RemoveCompleteReplicationRecords.java:88)
>         at org.apache.accumulo.master.replication.ReplicationDriver.run(ReplicationDriver.java:94)
> Caused by: org.apache.accumulo.core.client.AccumuloSecurityException: Error DEFAULT_SECURITY_ERROR
for user !SYSTEM on table replication(ID:3) - Unknown security exception
>         at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:690)
>         at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:592)
>         at org.apache.accumulo.core.metadata.MetadataLocationObtainer.lookupTablets(MetadataLocationObtainer.java:181)
>         at org.apache.accumulo.core.client.impl.TabletLocatorImpl.processInvalidated(TabletLocatorImpl.java:667)
>         at org.apache.accumulo.core.client.impl.TabletLocatorImpl.binRanges(TabletLocatorImpl.java:337)
>         at org.apache.accumulo.core.client.impl.TabletLocatorImpl.processInvalidated(TabletLocatorImpl.java:660)
>         at org.apache.accumulo.core.client.impl.TabletLocatorImpl.binRanges(TabletLocatorImpl.java:337)
>         at org.apache.accumulo.core.client.impl.TimeoutTabletLocator.binRanges(TimeoutTabletLocator.java:104)
>         at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.binRanges(TabletServerBatchReaderIterator.java:230)
>         at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.processFailures(TabletServerBatchReaderIterator.java:302)
>         at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.access$1400(TabletServerBatchReaderIterator.java:76)
>         at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator$QueryTask.run(TabletServerBatchReaderIterator.java:386)
>         at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
>         at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: ThriftSecurityException(user:!SYSTEM, code:null)
>         at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$startMultiScan_result$startMultiScan_resultStandardScheme.read(TabletClientService.java:10045)
>         at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$startMultiScan_result$startMultiScan_resultStandardScheme.read(TabletClientService.java:10022)
>         at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$startMultiScan_result.read(TabletClientService.java:9961)
>         at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
>         at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_startMultiScan(TabletClientService.java:313)
>         at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.startMultiScan(TabletClientService.java:293)
>         at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:632)
>         ... 17 more
> {noformat}
> TabletServer was still in the process of starting, but must have already obtained its
lock (otherwise we couldn't have talked to it). It appears that the exceptions starting repeatedly
printing in the Master log before the tserver hit it's main loop (lines 2414-2471 at f4024930).
> I think there may be a separate issue with the client receiving those Exceptions before
a tserver is "fully" up, but the Master thread needs to be resilient against these exceptions
bubbling up.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message