hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4721) RM to try to auth with HDFS on startup, retry with max diagnostics on failure
Date Tue, 08 Mar 2016 10:36:40 GMT

    [ https://issues.apache.org/jira/browse/YARN-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184752#comment-15184752

Steve Loughran commented on YARN-4721:

I'm trying to say, the test is "if this cluster has a default filesystem, it can be listed"

not HDFS, just that fs.default.name is not empty. We could even make the check conditional
on the cluster having a default FS. But if you do have a default FS, YARN had better be able
to talk to it. I will change the title to make clear that this is broader than just HDFS.

> One could argue for a stand-alone service (outside of YARN) that does these validations.

That doesn't address the problem I'm looking at, which is: validate that a specific process
started under a specific principal on a specific host has the credentials needed to access
a critical part of the cluster infrastructure, the default FS.

> So, the notion of "this cluster cannot talk to my HDFS" doesn't generalize. It is context
dependent and almost always "may app cannot talk to this and that HDFS instances".

I agree, which is why distcp will need special attention. However, YARN does have a specific
notion of the defaultFS for a filesystem; ATS1.5 ramps it by only working with an FS which
implements flush() by making data durable and visible to others (though it doesn't require
the metadata to be complete/visible). 

It's authentication of that YARN process to the cluster FS —or more specifically, identifying
why it sometimes doesn't happen— that I'm trying to look at.

Anyway, this initial patch doesn't attempt any of that, it looks at UGI.isSecurityEnabled,
and if so does some extra diagnostics, fails fast on a few conditions guaranteed to stop hadoop
working. Do you have any issues with that part of the patch?

> RM to try to auth with HDFS on startup, retry with max diagnostics on failure
> -----------------------------------------------------------------------------
>                 Key: YARN-4721
>                 URL: https://issues.apache.org/jira/browse/YARN-4721
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>    Affects Versions: 2.8.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>         Attachments: HADOOP-12889-001.patch
> If the RM can't auth with HDFS, this can first surface during job submission, which can
cause confusion about what's wrong and whose credentials are playing up.
> Instead, the RM could try to talk to HDFS on launch, {{ls /}} should suffice. If it can't
auth, it can then tell UGI to log more and retry.
> I don't know what the policy should be if the RM can't auth to HDFS at this point. Certainly
it can't currently accept work. But should it fail fast or keep going in the hope that the
problem is in the KDC or NN and will fix itself without an RM restart?

This message was sent by Atlassian JIRA

View raw message