hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kai Voigt...@123.org>
Subject Re: ALL HDFS Blocks on the Same Machine if Replication factor = 1
Date Mon, 10 Jun 2013 13:47:38 GMT

Am 10.06.2013 um 15:36 schrieb Razen Al Harbi <razen.alharbi@gmail.com>:

> I have deployed Hadoop on a cluster of 20 machines. I set the replication factor to one.
When I put a file (larger than HDFS block size) into HDFS, all the blocks are stored on the
machine where the Hadoop put command is invoked. 
> For higher replication factor, I see the same behavior but the replicated blocks are
stored randomly on all the other machines.
> Is this a normal behavior, if not what would be the cause?

Yes, this is normal behavior. When a HDFS client happens to run on a host that also is a DataNode
(always the case when a reducer writes its output), the first copy of a block is stored on
that very same node. This is to optimize the latency, it's faster to write to a local disk
than writing across the network.

The second copy of the block gets stored onto a random host in another rack (if your cluster
is configured to be rack-aware), to increase the distribution of the data.

The third copy of the block gets stored onto a random host in that other rack.

So your observations are correct.


Kai Voigt

View raw message