Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of bryan@rapleaf.com designates
 216.74.32.93 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=rapleaf.com; q=dns; s=m1; b=dm5rH
	Q7Bq9Zu+8GYTAGAM5wRcfGrFZZweMuIskvIF1tMFJcUO/WtbIVfBPWBjRDHvmvv7
	kGTfj+Eng2Ilf7sOI7bF6K6hGmslkvTn/80WIF5n5VhRZ1SWbp5CIO2UOOjIBRqe
	oYTJs26KE+IG4g0FMg5luisLgAeYtfLXD6g2QM=
Mime-Version: 1.0 (Apple Message framework v753.1)
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
Message-Id: <C56691C2-9757-46A0-8BBB-8C54AC0056A8@rapleaf.com>
Content-Transfer-Encoding: 7bit
From: Bryan Duxbury <bryan@rapleaf.com>
Subject: Massive discrepancies in job's bytes written/read
Date: Tue, 17 Mar 2009 17:26:52 -0700
To: core-user@hadoop.apache.org

Hey all,

In looking at the stats for a number of our jobs, the amount of data  
that the UI claims we've read from or written to HDFS is vastly  
larger than the amount of data that should be involved in the job.  
For instance, we have a job that combines small files into big files  
that we're operating on around 2TB worth of data. The outputs in HDFS  
(via hadoop dfs -du) matches the expected size, but the jobtracker UI  
claims that we've read and written around 22TB of data!

By all accounts, Hadoop is actually *doing* the right thing - we're  
not observing excess data reading or writing anywhere. However, this  
massive discrepancy makes the job stats essentially worthless for  
understanding IO in our jobs.

Does anyone know why there's such an enormous difference? Have others  
experienced this problem?

-Bryan