accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-2964) Unexpected ThriftSecurityException from BatchScanner
Date Fri, 11 Jul 2014 19:19:06 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059222#comment-14059222
] 

Josh Elser commented on ACCUMULO-2964:
--------------------------------------

Just saw this again last night, but on a (Batch)Scanner instead of a BatchWriter this time.
Same premise -- tserver was killed and restarted. ~30s of connection refused to the old server
and then suddenly a bunch of {{DEFAULT_SECURITY_ERROR}} thrift exceptions.

Another interesting difference is that the exceptions i'm seeing this time are actually for
!SYSTEM too, not just root.

{noformat}
2014-07-11 04:13:43,713 [impl.TabletServerBatchReaderIterator] DEBUG: Server : juno:59672
msg : null
ThriftSecurityException(user:!SYSTEM, code:null)
        at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$startMultiScan_result$startMultiScan_resultStandardScheme.read(TabletClientService.java:10045)
        at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$startMultiScan_result$startMultiScan_resultStandardScheme.read(TabletClientService.java:10022)
        at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$startMultiScan_result.read(TabletClientService.java:9961)
        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
        at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_startMultiScan(TabletClientService.java:313)
        at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.startMultiScan(TabletClientService.java:293)
        at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:632)
        at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:592)
        at org.apache.accumulo.core.metadata.MetadataLocationObtainer.lookupTablets(MetadataLocationObtainer.java:181)
        at org.apache.accumulo.core.client.impl.TabletLocatorImpl.processInvalidated(TabletLocatorImpl.java:667)
        at org.apache.accumulo.core.client.impl.TabletLocatorImpl.binRanges(TabletLocatorImpl.java:337)
        at org.apache.accumulo.core.client.impl.TimeoutTabletLocator.binRanges(TimeoutTabletLocator.java:104)
        at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.binRanges(TabletServerBatchReaderIterator.java:230)
        at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.lookup(TabletServerBatchReaderIterator.java:217)
        at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.<init>(TabletServerBatchReaderIterator.java:155)
        at org.apache.accumulo.core.client.impl.TabletServerBatchReader.iterator(TabletServerBatchReader.java:115)
        at org.apache.accumulo.server.master.state.MetaDataTableScanner.<init>(MetaDataTableScanner.java:66)
        at org.apache.accumulo.server.master.state.MetaDataTableScanner.<init>(MetaDataTableScanner.java:56)
        at org.apache.accumulo.server.master.state.MetaDataStateStore.iterator(MetaDataStateStore.java:67)
        at org.apache.accumulo.master.TabletGroupWatcher.run(TabletGroupWatcher.java:158)
{noformat}

{noformat}
2014-07-11 04:13:43,714 [master.Master] ERROR: Error processing table state for store Normal
Tablets
java.lang.RuntimeException: java.lang.RuntimeException: Failed to create iterator
        at org.apache.accumulo.server.master.state.MetaDataTableScanner.<init>(MetaDataTableScanner.java:72)
        at org.apache.accumulo.server.master.state.MetaDataTableScanner.<init>(MetaDataTableScanner.java:56)
        at org.apache.accumulo.server.master.state.MetaDataStateStore.iterator(MetaDataStateStore.java:67)
        at org.apache.accumulo.master.TabletGroupWatcher.run(TabletGroupWatcher.java:158)
Caused by: java.lang.RuntimeException: Failed to create iterator
        at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.<init>(TabletServerBatchReaderIterator.java:159)
        at org.apache.accumulo.core.client.impl.TabletServerBatchReader.iterator(TabletServerBatchReader.java:115)
        at org.apache.accumulo.server.master.state.MetaDataTableScanner.<init>(MetaDataTableScanner.java:66)
        ... 3 more
Caused by: org.apache.accumulo.core.client.AccumuloSecurityException: Error DEFAULT_SECURITY_ERROR
for user !SYSTEM - Unknown security exception
        at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:690)
        at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:592)
        at org.apache.accumulo.core.metadata.MetadataLocationObtainer.lookupTablets(MetadataLocationObtainer.java:181)
        at org.apache.accumulo.core.client.impl.TabletLocatorImpl.processInvalidated(TabletLocatorImpl.java:667)
        at org.apache.accumulo.core.client.impl.TabletLocatorImpl.binRanges(TabletLocatorImpl.java:337)
        at org.apache.accumulo.core.client.impl.TimeoutTabletLocator.binRanges(TimeoutTabletLocator.java:104)
        at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.binRanges(TabletServerBatchReaderIterator.java:230)
        at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.lookup(TabletServerBatchReaderIterator.java:217)
        at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.<init>(TabletServerBatchReaderIterator.java:155)
        ... 5 more
Caused by: ThriftSecurityException(user:!SYSTEM, code:null)
        at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$startMultiScan_result$startMultiScan_resultStandardScheme.read(TabletClientService.java:10045)
        at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$startMultiScan_result$startMultiScan_resultStandardScheme.read(TabletClientService.java:10022)
        at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$startMultiScan_result.read(TabletClientService.java:9961)
        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
        at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_startMultiScan(TabletClientService.java:313)
        at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.startMultiScan(TabletClientService.java:293)
        at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:632)
        ... 13 more
{noformat}

In addition to the normal processing, the master was also trying to write out some new mutations
for the purpose of replication which started failing. The odd part is that the failure says
it was for accumulo.metadata, but the mutations were for the replication table, not accumulo.metadata

{noformat}
org.apache.accumulo.core.client.MutationsRejectedException: # constraint violations : 0  security
codes: {accumulo.metadata(ID:!0)=[DEFAULT_SECURITY_ERROR]}  # server errors 0 # exceptions
0
        at org.apache.accumulo.core.client.impl.TabletServerBatchWriter.checkForFailures(TabletServerBatchWriter.java:537)
        at org.apache.accumulo.core.client.impl.TabletServerBatchWriter.flush(TabletServerBatchWriter.java:331)
        at org.apache.accumulo.core.client.impl.BatchWriterImpl.flush(BatchWriterImpl.java:61)
        at org.apache.accumulo.master.replication.WorkMaker.addWorkRecord(WorkMaker.java:192)
        at org.apache.accumulo.master.replication.WorkMaker.run(WorkMaker.java:124)
        at org.apache.accumulo.master.replication.ReplicationDriver.run(ReplicationDriver.java:91)
{noformat}

While the first exceptions eventually stopped, the latter kept repeatedly failing for the
duration of the test (which ultimately failed). Both cases are similar (repeatedly executed
code inside of the master), but the former recreates the BatchScanner whereas the latter attempts
to reuse the same BatchWriter.

I'm wondering if there's an issue in the BatchWriter that's causing it to become useless after
the tserver underneath died/went-away. In the above stacktrace, it appears as if this is the
case.


> Unexpected ThriftSecurityException from BatchScanner
> ----------------------------------------------------
>
>                 Key: ACCUMULO-2964
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2964
>             Project: Accumulo
>          Issue Type: Bug
>          Components: client, tserver
>            Reporter: Josh Elser
>            Priority: Minor
>             Fix For: 1.7.0
>
>
> This is something I've only seen a handful of times when writing/running tests that stop
and restart tservers. After the tserver is restarted, there is a thread (typically running
in the master) which is trying to read a table. As such, the thread will continue to poll
until the tserver comes up.
> Very infrequently, the client gets a {{ThriftSecurityException}} with a code of {{DEFAULT_SECURITY_ERROR}}
and a message of {{Unknown security exception}}. There is no additional information in the
client log (from the thrift call inside the batchscanner), and the tserver contains no error
messages at all.
> The error that the client saw.
> {noformat}
> 2014-07-01 04:18:18,971 [impl.TabletServerBatchReaderIterator] DEBUG: Server : host:58090
msg : null
> ThriftSecurityException(user:!SYSTEM, code:null)
>         at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$startMultiScan_result$startMultiScan_resultStandardScheme.read(TabletClientService.java:10045)
>         at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$startMultiScan_result$startMultiScan_resultStandardScheme.read(TabletClientService.java:10022)
>         at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$startMultiScan_result.read(TabletClientService.java:9961)
>         at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
>         at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_startMultiScan(TabletClientService.java:313)
>         at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.startMultiScan(TabletClientService.java:293)
>         at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:632)
>         at org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:592)
>         at org.apache.accumulo.core.metadata.MetadataLocationObtainer.lookupTablets(MetadataLocationObtainer.java:181)
>         at org.apache.accumulo.core.client.impl.TabletLocatorImpl.processInvalidated(TabletLocatorImpl.java:667)
>         at org.apache.accumulo.core.client.impl.TabletLocatorImpl.binRanges(TabletLocatorImpl.java:337)
>         at org.apache.accumulo.core.client.impl.TabletLocatorImpl.processInvalidated(TabletLocatorImpl.java:660)
>         at org.apache.accumulo.core.client.impl.TabletLocatorImpl._locateTablet(TabletLocatorImpl.java:610)
>         at org.apache.accumulo.core.client.impl.TabletLocatorImpl.locateTablet(TabletLocatorImpl.java:440)
>         at org.apache.accumulo.core.client.impl.ThriftScanner.scan(ThriftScanner.java:226)
>         at org.apache.accumulo.core.client.impl.ScannerIterator$Reader.run(ScannerIterator.java:84)
>         at org.apache.accumulo.core.client.impl.ScannerIterator.hasNext(ScannerIterator.java:177)
>         at org.apache.accumulo.master.replication.DistributedWorkQueueWorkAssigner.createWork(DistributedWorkQueueWorkAssigner.java:161)
>         at org.apache.accumulo.master.replication.DistributedWorkQueueWorkAssigner.assignWork(DistributedWorkQueueWorkAssigner.java:140)
>         at org.apache.accumulo.master.replication.WorkDriver.run(WorkDriver.java:97)
> {noformat}
> The interesting part is that when the client saw this message, the new TabletServer was
already started, and the old tabletserver appears to have been dead for 20s. So, the client
in the master had been polling for 20s getting a ConnectException (connection refused) which
is expected. I don't know why we got this exception after a length of time.
> The infrequency in which I see this makes me wonder if the random ports in the new tabletserver
are somehow re-grabbing the old tserver's thrift client service port and something is unexpectedly
being interpreted as this ThriftSecurityException? That's the only thing that seems remotely
possible to me. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message