Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 45857 invoked from network); 18 Mar 2009 00:27:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 18 Mar 2009 00:27:27 -0000 Received: (qmail 95438 invoked by uid 500); 18 Mar 2009 00:27:19 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 95394 invoked by uid 500); 18 Mar 2009 00:27:19 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 95383 invoked by uid 99); 18 Mar 2009 00:27:19 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Mar 2009 17:27:19 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of bryan@rapleaf.com designates 216.74.32.93 as permitted sender) Received: from [216.74.32.93] (HELO mail.rapleaf.com) (216.74.32.93) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Mar 2009 00:27:11 +0000 Received: from mail.rapleaf.com (localhost.localdomain [127.0.0.1]) by mail.rapleaf.com (Postfix) with ESMTP id 642C71250386; Tue, 17 Mar 2009 17:26:51 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=rapleaf.com; q=dns; s=m1; b=dm5rH Q7Bq9Zu+8GYTAGAM5wRcfGrFZZweMuIskvIF1tMFJcUO/WtbIVfBPWBjRDHvmvv7 kGTfj+Eng2Ilf7sOI7bF6K6hGmslkvTn/80WIF5n5VhRZ1SWbp5CIO2UOOjIBRqe oYTJs26KE+IG4g0FMg5luisLgAeYtfLXD6g2QM= Received: from [192.168.1.11] (unknown [192.168.1.11]) by mail.rapleaf.com (Postfix) with ESMTP id 4D90A1250365; Tue, 17 Mar 2009 17:26:51 -0700 (PDT) Mime-Version: 1.0 (Apple Message framework v753.1) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: Content-Transfer-Encoding: 7bit From: Bryan Duxbury Subject: Massive discrepancies in job's bytes written/read Date: Tue, 17 Mar 2009 17:26:52 -0700 To: core-user@hadoop.apache.org X-Mailer: Apple Mail (2.753.1) X-Virus-Checked: Checked by ClamAV on apache.org Hey all, In looking at the stats for a number of our jobs, the amount of data that the UI claims we've read from or written to HDFS is vastly larger than the amount of data that should be involved in the job. For instance, we have a job that combines small files into big files that we're operating on around 2TB worth of data. The outputs in HDFS (via hadoop dfs -du) matches the expected size, but the jobtracker UI claims that we've read and written around 22TB of data! By all accounts, Hadoop is actually *doing* the right thing - we're not observing excess data reading or writing anywhere. However, this massive discrepancy makes the job stats essentially worthless for understanding IO in our jobs. Does anyone know why there's such an enormous difference? Have others experienced this problem? -Bryan