Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 54303 invoked from network); 7 Jan 2009 18:16:34 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 7 Jan 2009 18:16:34 -0000 Received: (qmail 10612 invoked by uid 500); 7 Jan 2009 18:16:27 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 10579 invoked by uid 500); 7 Jan 2009 18:16:27 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 10568 invoked by uid 99); 7 Jan 2009 18:16:26 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Jan 2009 10:16:26 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of cutting@gmail.com designates 74.125.46.31 as permitted sender) Received: from [74.125.46.31] (HELO yw-out-2324.google.com) (74.125.46.31) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Jan 2009 18:16:19 +0000 Received: by yw-out-2324.google.com with SMTP id 9so2902586ywe.29 for ; Wed, 07 Jan 2009 10:15:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:sender:message-id:date:from :user-agent:mime-version:to:subject:references:in-reply-to :content-type:content-transfer-encoding; bh=aYFLWsZ+HCDqVfwO9Xtz1CwYnxxLBzf9k6nLrEzohPc=; b=TIpxyN9EpOh+3hDupse9L1jpQNbmT/9hc4m1Lbcj7g5bqTY1KocrR42Gb2y/2FHnei 99UVs3/v6AB/lu8f08bMU4/AvldEg4ILwm1KTmWJx7ZKbdmViRYoLsMnfQQ6SocTcXQW 1ZW/3VirL/iya3Uo0sN6Ahl2X5ckbdBBuLAnM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=sender:message-id:date:from:user-agent:mime-version:to:subject :references:in-reply-to:content-type:content-transfer-encoding; b=CLnFy1YrBX3841EWXpU0EVEOcN7mDqQAt3SI/zibmKv04hVUh4KAwU5rDnMl87nHLf 1eYHeo5Bern9SRvJqwqGvpuLl+CCqcADEi00lzGdIJNoeoq/UJQGMGzAP4MhVbDedOdv /vM7j1DHBZgPXgnxCFBTNSaavMkOEujVzUvGI= Received: by 10.142.43.19 with SMTP id q19mr9751953wfq.187.1231352158275; Wed, 07 Jan 2009 10:15:58 -0800 (PST) Received: from ?192.168.168.16? (c-76-103-150-78.hsd1.ca.comcast.net [76.103.150.78]) by mx.google.com with ESMTPS id 24sm39836648wff.17.2009.01.07.10.15.56 (version=TLSv1/SSLv3 cipher=RC4-MD5); Wed, 07 Jan 2009 10:15:57 -0800 (PST) Sender: Doug Cutting Message-ID: <4964F162.5030902@apache.org> Date: Wed, 07 Jan 2009 10:16:02 -0800 From: Doug Cutting User-Agent: Thunderbird 2.0.0.18 (X11/20081125) MIME-Version: 1.0 To: core-user@hadoop.apache.org Subject: Re: Auditing and accounting with Hadoop References: <52A55BBC-4052-4EFA-AE39-BF9AF57B1FDB@cse.unl.edu> In-Reply-To: <52A55BBC-4052-4EFA-AE39-BF9AF57B1FDB@cse.unl.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org The notion of a client/task ID, independent of IP or username seems useful for log analysis. DFS's client ID is probably currently your best bet, but we might improve its implementation, and make the notion more generic. It is currently implemented as: String taskId = conf.get("mapred.task.id"); if (taskId != null) { this.clientName = "DFSClient_" + taskId; } else { this.clientName = "DFSClient_" + r.nextInt(); } This hardwires a mapred dependency, which is fragile, and it's fairly useless outside of mapreduce, degenerating to a random number. Rather we should probably have a configuration property that's explicitly used to indicate the user-level task, that's different than the username, IP, etc. For MapReduce jobs this could default to the job's ID, but applications might override it. So perhaps we could add static methods FileSystem.{get,set}TaskId(Configuration), then change the logging code to use this? What do others think? Doug Brian Bockelman wrote: > Hey, > > One of our charges is to do auditing and accounting with our file > systems (we use the simplifying assumption that the users are > non-malicious). > > Auditing can be done by going through the namenode logs and utilizing > the UGI information to track opens/reads/writes back to the users. > Accounting can be done by adding up the byte counts from the datanode > traces (or via the lovely metrics interfaces). However, joining them > together appears to be impossible! The namenode audits record > originating IP and UGI; the datanode audits contain the originating IP > and DFSClient ID. With 8 clients (and possibly 8 users) opening > multiple files all from the same IP, it becomes a mess to untangle. > > For example, in other filesystems, we've been able to construct a > database with one row representing a file access from open-to-close. We > record the username, amount of time the file was open, number of bytes > read, the remote IP, and the server which served the file (previous > filesystem saved an entire file on server, not blocks). Already, that > model quickly is problematic as several servers take part in serving the > file to the client. The depressing, horrible file access pattern (Worse > than random! To read a 1MB record entirely with a read-buffer size of > 10MB, you can possibly read up to 2GB) of some jobs means that recording > each read is not practical. > > I'd like to record audit records and transfer accounting (at some level) > into the DB. Does anyone have any experience in doing this? It seems > that, if I can add the DFSClient ID into the namenode logs, I can record: > 1) Each open (but miss the corresponding close) of a file at the > namenode, along with the UGI, timestamp, IP > 2) Each read/write on a datanode records the datanode, remote IP, > DFSClient, bytes written/read, (but I miss the overall transaction > time! Possibly could be logged). Don't record the block ID, as I can't > map block ID -> file name in a cheap/easy manner (I'd have to either do > this synchronously, causing a massive performance hit -- or do this > asynchronously, and trip up over any files which were deleted after they > were read). > > This would allow me to see who is accessing what files, and how much > that client is reading - but not necessarily which files they read from, > if the same client ID is used for multiple files. This also will allow > me to trace reads back to specific users (so I can tell who has the > worst access patterns and beat them). > > So, my questions are: > a) Is anyone doing anything remotely similar which I can reuse? > b) Is there some hole in my logic which would render the approach useless? > c) Is my approach reasonable? I.e., should I really be looking at > inserting hooks into the DFSClient, as that's the only thing which can > tell me information like "when did the client close the file?"? > > Advise is welcome. > > Brian