hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raghu Angadi <rang...@yahoo-inc.com>
Subject Re: performace questions
Date Sat, 09 Jun 2007 02:08:44 GMT

Your interest is good. I think you should ask even smaller number of 
questions in one mail and try to do more experimentation.

Bwolen Yang wrote:
> Here is a summary of my remaining questions from the [write and sort
> performance] thread.
> 
> - Looks like every 5GB data I put into Hadoop DFS, it uses up ~18GB of
> raw disk space (based on block counts exported from namenode).
> Accounting for 3x replication, I was expecting 15GB. what's causing
> this 20% overhead?

You are asuming each block is is 64M. There are some blocks for "CRC 
files". Did you try to du the datanode's 'data directories'?

> - when large amount of data is written to HFS (for example
> copyFromLocal), are the file block replication pipelined?  Also, does
> one 64MB block needs to be fully replicated before the next 64MB copy
> can start?

They are pipelined. Again you can experiment by trying with single 
replica (in config) and see if runs much faster. If it does not, then 
they should be pipelined.

Raghu.

Mime
View raw message