hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allen Wittenauer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9184) Logging HDFS operation's caller context into audit logs
Date Fri, 02 Oct 2015 19:25:29 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14941613#comment-14941613

Allen Wittenauer commented on HDFS-9184:

bq. Is it documented anywhere that the audit log is key/value? I didn't see any specification
for the format...

It's a) not documented and b) not a kvp.

Story time. This is going to be the shorter version.  

I have few regrets about things I helped design in Hadoop, but this does happen to be one
of them especially due to all of the misunderstanding around what it's purpose in life is
and how people actually use it.  When [~chris.douglas] and I did the design work on the audit
log back in 2008 (IIRC), I specifically wanted a fixed field log file format.  We were going
to be writing ops tools to answer questions that we the ops team simply could not. It was
important that the format stay fixed for a variety of reasons:

* The ops team at Y! was tiny with a mix of junior and senior folks. The junior folks were
likely going to be the ones writing the code since the senior folks were busy dealing with
the continual fallout from the weekly Hadoop upgrades and just getting a working infrastructure
in place while we moved away from YST.  (... and getting ops-specific tooling out of dev was
regularly blocked by management ...)

* We needed to make sure that no matter what the devs added to Hadoop, the log file wouldn't
change.  At that point in time, the logs for things like the NN were wildly fluctuating and
were pretty much impossible to use for any sort of metrics or monitoring.  We needed a safespace
that was away from the turmoil happening in the rest of the system.  If the system would have
been open ended, it would have been absolute hell to work with.  Forcing a format that at
that point covered 100% of the foreseeable use cases solved that problem.

*  The content was modeled after Solaris BSM with a few key differences.  BSM wrote in binary
which just wasn't a real option without us pulling out more advanced techniques. It would
fail the 'quick and dirty' tests that the ops team had to have in order to fulfill user needs.
BSM also supported a heck of a lot more than Hadoop did.  So a straight logfile it was.

Now one of the things I wanted to avoid was the "tab problem".  e.g., fields that are empty
end up looking like field<tab><tab>field. So we settled on a <column label>=<value>
format where every label would always be present so that we could then use spaces to break
up the columns.  [Thus why I say it is *not* kvp.  In most key-value stores that I've worked
with, it's rare to see key=(null)]. 

I've also heard that the file is a "weird form of JSON".  No, it's not.  In fact, I vetoed
JSON because of the extra parsing overhead with very little gain to be seen by doing that
vs. just fixing all the fields.

Now, what would I do differently?  #1 would be documentation with a clear explanation of this
history, covering the whys and the hows.  #2 would probably be to make it officially key value
with some fields being required.  But that's a different problem altogether....

> Logging HDFS operation's caller context into audit logs
> -------------------------------------------------------
>                 Key: HDFS-9184
>                 URL: https://issues.apache.org/jira/browse/HDFS-9184
>             Project: Hadoop HDFS
>          Issue Type: Task
>            Reporter: Mingliang Liu
>            Assignee: Mingliang Liu
>         Attachments: HDFS-9184.000.patch
> For a given HDFS operation (e.g. delete file), it's very helpful to track which upper
level job issues it. The upper level callers may be specific Oozie tasks, MR jobs, and hive
queries. One scenario is that the namenode (NN) is abused/spammed, the operator may want to
know immediately which MR job should be blamed so that she can kill it. To this end, the caller
context contains at least the application-dependent "tracking id".
> There are several existing techniques that may be related to this problem.
> 1. Currently the HDFS audit log tracks the users of the the operation which is obviously
not enough. It's common that the same user issues multiple jobs at the same time. Even for
a single top level task, tracking back to a specific caller in a chain of operations of the
whole workflow (e.g.Oozie -> Hive -> Yarn) is hard, if not impossible.
> 2. HDFS integrated {{htrace}} support for providing tracing information across multiple
layers. The span is created in many places interconnected like a tree structure which relies
on offline analysis across RPC boundary. For this use case, {{htrace}} has to be enabled at
100% sampling rate which introduces significant overhead. Moreover, passing additional information
(via annotations) other than span id from root of the tree to leaf is a significant additional
> 3. In [HDFS-4680 | https://issues.apache.org/jira/browse/HDFS-4680], there are some related
discussion on this topic. The final patch implemented the tracking id as a part of delegation
token. This protects the tracking information from being changed or impersonated. However,
kerberos authenticated connections or insecure connections don't have tokens. [HADOOP-8779]
proposes to use tokens in all the scenarios, but that might mean changes to several upstream
projects and is a major change in their security implementation.
> We propose another approach to address this problem. We also treat HDFS audit log as
a good place for after-the-fact root cause analysis. We propose to put the caller id (e.g.
Hive query id) in threadlocals. Specially, on client side the threadlocal object is passed
to NN as a part of RPC header (optional), while on sever side NN retrieves it from header
and put it to {{Handler}}'s threadlocals. Finally in {{FSNamesystem}}, HDFS audit logger will
record the caller context for each operation. In this way, the existing code is not affected.
> It is still challenging to keep "lying" client from abusing the caller context. Our proposal
is to add a {{signature}} field to the caller context. The client choose to provide its signature
along with the caller id. The operator may need to validate the signature at the time of offline
analysis. The NN is not responsible for validating the signature online.

This message was sent by Atlassian JIRA

View raw message