hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 <java8...@hotmail.com>
Subject RE: HDFS File Writes & Reads
Date Thu, 19 Jun 2014 13:35:25 GMT
What your understanding is almost correct, but not with the part your highlighted.
The HDFS is not designed for write performance, but the client doesn't have to wait for the
acknowledgment of previous packets before sending the next packets.
This webpage describes it clearly, and hope it is helpful for you.
The next packet can be pushed to the pipeline before receiving the acknowledgment for the
previous packets. The number of outstanding packets is limited by the outstanding packets
window size of the client.
Do you have any requirements of performance of ingesting data into HDFS?

Date: Thu, 19 Jun 2014 11:51:43 +0530
Subject: Re: HDFS File Writes & Reads
From: vijay.bhoomireddy@gmail.com
To: user@hadoop.apache.org

@Zeshen Wu,Thanks for the response.
I still don't understand how HDFS reduces the time to write and read a file, compared to a
traditional file read / write mechanism. 

For example, if I am writing a file, using the default configurations, Hadoop internally has
to write each block to 3 data nodes. My understanding is that for each block, first the client
writes the block to the first data node in the pipeline which will then inform the second
and so on. Once the third data node successfully receives the block, it provides an acknowledgement
back to data node 2 and finally to the client through Data node 1. Only after receiving the
acknowledgement for the block, the write is considered successful and the client proceeds
to write the next block.

If this is the case, then the time taken to write each block is 3 times than the normal write
due to the replication factor and the write process is happening sequentially block after

Please correct me if I am wrong in my understanding. Also, the following questions below:
1. My understanding is that File read / write in Hadoop doesn't have any parallelism and the
best it can perform is same to a traditional file read or write + some overhead involved in
the distributed communication mechanism.
2. Parallelism is provided only during the data processing phase via Map Reduce, but not during
file read / write by a client.

On 17 June 2014 19:37, Zesheng Wu <wuzesheng86@gmail.com> wrote:

1. HDFS doesn't allow parallel write2. HDFS use pipeline to write multiple replicas, so it
doesn't take three times more time than a traditional file write3. HDFS allow parallel read

2014-06-17 19:17 GMT+08:00 Vijaya Narayana Reddy Bhoomi Reddy <vijay.bhoomireddy@gmail.com>:


I have a basic question regarding file writes and reads in HDFS. Is the file write and read
process a sequential activity or executed in parallel?

For example, lets assume that there is a File File1 which constitutes of three blocks B1,
B2 and B3. 

1. Will the write process write B2 only after B1 is complete and B3 only after B2 is complete
or for a large file with many blocks, can this happen in parallel? In all the hadoop documentation,
I read this to be a sequential operation. Does that mean for a file of 1TB, it takes three
times more time than a traditional file write? (due to default replication factor of 3)

2. Is it similar in the case of read as well?

Kindly someone please provide some clarity on this...


Best Wishes!

Yours, Zesheng

View raw message