hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Bockelman <bbock...@cse.unl.edu>
Subject Auditing and accounting with Hadoop
Date Wed, 07 Jan 2009 15:03:54 GMT
Hey,

One of our charges is to do auditing and accounting with our file  
systems (we use the simplifying assumption that the users are non- 
malicious).

Auditing can be done by going through the namenode logs and utilizing  
the UGI information to track opens/reads/writes back to the users.   
Accounting can be done by adding up the byte counts from the datanode  
traces (or via the lovely metrics interfaces).  However, joining them  
together appears to be impossible!  The namenode audits record  
originating IP and UGI; the datanode audits contain the originating IP  
and DFSClient ID.  With 8 clients (and possibly 8 users) opening  
multiple files all from the same IP, it becomes a mess to untangle.

For example, in other filesystems, we've been able to construct a  
database with one row representing a file access from open-to-close.   
We record the username, amount of time the file was open, number of  
bytes read, the remote IP, and the server which served the file  
(previous filesystem saved an entire file on server, not blocks).   
Already, that model quickly is problematic as several servers take  
part in serving the file to the client.  The depressing, horrible file  
access pattern (Worse than random!  To read a 1MB record entirely with  
a read-buffer size of 10MB, you can possibly read up to 2GB) of some  
jobs means that recording each read is not practical.

I'd like to record audit records and transfer accounting (at some  
level) into the DB.  Does anyone have any experience in doing this?   
It seems that, if I can add the DFSClient ID into the namenode logs, I  
can record:
1) Each open (but miss the corresponding close) of a file at the  
namenode, along with the UGI, timestamp, IP
2) Each read/write on a datanode records the datanode, remote IP,  
DFSClient, bytes written/read, (but I miss the overall transaction  
time!  Possibly could be logged).  Don't record the block ID, as I  
can't map block ID -> file name in a cheap/easy manner (I'd have to  
either do this synchronously, causing a massive performance hit -- or  
do this asynchronously, and trip up over any files which were deleted  
after they were read).

This would allow me to see who is accessing what files, and how much  
that client is reading - but not necessarily which files they read  
from, if the same client ID is used for multiple files.  This also  
will allow me to trace reads back to specific users (so I can tell who  
has the worst access patterns and beat them).

So, my questions are:
a) Is anyone doing anything remotely similar which I can reuse?
b) Is there some hole in my logic which would render the approach  
useless?
c) Is my approach reasonable?  I.e., should I really be looking at  
inserting hooks into the DFSClient, as that's the only thing which can  
tell me information like "when did the client close the file?"?

Advise is welcome.

Brian

Mime
View raw message