Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of prasen.bea@gmail.com designates
 209.85.215.48 as permitted sender)
MIME-Version: 1.0
Date: Mon, 12 Dec 2011 22:40:24 +0530
Message-ID: 
 <CAM7FQQtYUQ-Exf7+sfG+nFjFwu+7mG3AVTJuRpPO6XV-QvScEw@mail.gmail.com>
Subject: Awesome post on Hadoop. Some questions...
From: prasenjit mukherjee <prasen.bea@gmail.com>
To: common-user <common-user@hadoop.apache.org>
Content-Type: text/plain; charset=ISO-8859-1

Really enthralled to read the post :
http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/
Great job done.

Some related questions:

1. The article says that hdfs always maintains 2 copies in the same
rack and 3rd in a different rack. This only speeds up the hdfs "put" (
fileCreation ) time. But wont it be better be to spread it across 3
racks ? What other advantage will it have for this 2+1 approach.

2.  In HDFS the client reads block sequentially. Why the clients cant
read the blocks parallel-y  ?  wont it speed up lookups from client's
perspective ?

3. There are some cases in which a Data Node daemon itself will need
to read a block of data from HDFS. When would a data node need to read
from other data nodes ? Is it  when split-size is more than block size
? Even in that case its the tasktracker which should ask for the data
and not the data node

-Thanks
Prasenjit .