hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Friso van Vollenhoven <fvanvollenho...@xebia.com>
Subject Auditable Hadoop
Date Fri, 28 Oct 2011 10:12:32 GMT
Hi all,

I have a auditing challenge. I am looking for a quite detailed level of audit trail on MR
jobs. I know that HDFS has a audit log, which you can write to a separate file through log4j
config. But what I ideally need is something that allows to determine, with certainty, which
jobs were run against what data and by whom. Now, with 'which jobs', I mean source code, not
just the binary.

One idea I had for basic Java based MR jobs is to have devs (me) check the job's code in the
some SCM and then have a tool that does a checkout of a particular branch on a machine that
has cluster access and is not open to the dev team. The tool would do a checkout and a build
and write the SCM tag + a hash of the binary (.jar) to a audit trail. Then I'd need some kind
of hook in the job tracker that checks whether a submitted job's binary is properly audited.
This way, you could always find the source that was executed.

Question is: is there a possibility for such a hook in the JT? Or do I need to patch it. It
would be nice to have the auditing happen in the JT, such that the dev team can have regular
access to the cluster (thus use the hadoop command line tool to copy/move files, etc.) and
it would just reject jobs that have not been audited.

Also, with this model, non-Java based jobs are a problem. I probably won't be using streaming,
but Pig, Hive and Mahout will likely be used. For these I'd need some additional steps: confirming
that the Pig / Hive / Mahout binaries which are submitted are trusted ones and have Pig/Hive/Mahout
add some config params or other info about the script / query that is being executed in the

Does anyone have any ideas on this? Or relevant experiences?


View raw message