accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Newton (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-3880) Malformed Configuration Causes tservers To Shutdown
Date Tue, 02 Jun 2015 17:23:49 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569457#comment-14569457
] 

Eric Newton commented on ACCUMULO-3880:
---------------------------------------

Oh, right... it's coming back to me.

The basic idea is: a dead server has been brought back to life by some autonomous system.
If it doesn't have the right "secret" it should not join the team.

# I don't know that this has been a real problem. It's the kinda thing that *is* a problem,
and is a result of general paranoia about half-capable services.
# This is really old code, so if we merged a cluster or something, it would have failed fast,
and nobody would have documented an issue.
# There should be a better way to figure out "I'm not in the right cluster".  Some basic check
of the cluster id, for example.

I don't know that the original problem is really an issue, but advanced re-structuring of
resources definitely is a problem. But tablet server locks are held under the instance id,
so we automatically have some guarantee of "in the right instance."

> Malformed Configuration Causes tservers To Shutdown
> ---------------------------------------------------
>
>                 Key: ACCUMULO-3880
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3880
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.7.0
>         Environment: HDP 2.2.7.0 to HDP 2.3.0.0 Upgrade
>            Reporter: Jonathan Hurley
>            Assignee: Josh Elser
>            Priority: Critical
>             Fix For: 1.6.3, 1.7.0, 1.8.0
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> During a rolling upgrade from HDP 2.2 to HDP 2.3, Accumulo tracer fails to start because
it is unable to find any tabletservers. The tabletserver were updated to HDP 2.3 earlier in
the upgrade process and did come online briefly. 
> The PID file still exist, but the tservers are definitely down:
> {noformat}
> [root@c6401 accumulo]# cat accumulo-accumulo-tserver.pid
> 6075
> [root@c6401 accumulo]# ps -a | grep 6075
> {noformat}
> It seems like the problem might be located in the following piece of code:
> {code}
>     private void checkPermission(TCredentials credentials, String lock, final String
request) throws ThriftSecurityException {
>       boolean fatal = false;
>       try {
>         log.trace("Got " + request + " message from user: " + credentials.getPrincipal());
>         if (!security.canPerformSystemActions(credentials)) {
>           log.warn("Got " + request + " message from user: " + credentials.getPrincipal());
>           throw new ThriftSecurityException(credentials.getPrincipal(), SecurityErrorCode.PERMISSION_DENIED);
>         }
>       } catch (ThriftSecurityException e) {
>         log.warn("Got " + request + " message from unauthenticatable user: " + e.getUser());
>         if (getCredentials().getToken().getClass().getName().equals(credentials.getTokenClassName()))
{
>           log.error("Got message from a service with a mismatched configuration. Please
ensure a compatible configuration.", e);
>           fatal = true;
>         }
>         throw e;
>       } finally {
>         if (fatal) {
>           Halt.halt(1, new Runnable() {
>             @Override
>             public void run() {
>               gcLogger.logGCInfo(TabletServer.this.getConfiguration());
>             }
>           });
>         }
>       }
> {code}
> Where a malformed principal causes a {{Halt}}.
> From the tserver logs:
> {noformat}
> 2015-06-01 19:25:30,462 [rpc.TServerUtils] DEBUG: Instantiating default, unsecure custom
half-async Thrift server
> 2015-06-01 19:25:30,468 [tserver.TabletServer] INFO : address = c6401.ambari.apache.org:9997
> 2015-06-01 19:25:30,510 [tserver.TabletServer] INFO : Waiting for tablet server lock
> {noformat}
> There is also no content in the *.out or *.err files for tserver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message