hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vijaya Narayana Reddy Bhoomi Reddy <vijay.bhoomire...@gmail.com>
Subject Re: HDFS File Writes & Reads
Date Thu, 19 Jun 2014 06:21:43 GMT
@Zeshen Wu,Thanks for the response.

I still don't understand how HDFS reduces the time to write and read a
file, compared to a traditional file read / write mechanism.

For example, if I am writing a file, using the default configurations,
Hadoop internally has to write each block to 3 data nodes. My understanding
is that for each block, first the client writes the block to the first data
node in the pipeline which will then inform the second and so on. Once the
third data node successfully receives the block, it provides an
acknowledgement back to data node 2 and finally to the client through Data
node 1. *Only after receiving the acknowledgement for the block, the write
is considered successful and the client proceeds to write the next block.*

If this is the case, then the time taken to write each block is 3 times
than the normal write due to the replication factor and the write process
is happening sequentially block after block.

Please correct me if I am wrong in my understanding. Also, the following
questions below:

1. My understanding is that File read / write in Hadoop doesn't have any
parallelism and the best it can perform is same to a traditional file read
or write + some overhead involved in the distributed communication
2. Parallelism is provided only during the data processing phase via Map
Reduce, but not during file read / write by a client.


On 17 June 2014 19:37, Zesheng Wu <wuzesheng86@gmail.com> wrote:

> 1. HDFS doesn't allow parallel write
> 2. HDFS use pipeline to write multiple replicas, so it doesn't take three
> times more time than a traditional file write
> 3. HDFS allow parallel read
> 2014-06-17 19:17 GMT+08:00 Vijaya Narayana Reddy Bhoomi Reddy <
> vijay.bhoomireddy@gmail.com>:
> Hi,
>> I have a basic question regarding file writes and reads in HDFS. Is the
>> file write and read process a sequential activity or executed in parallel?
>> For example, lets assume that there is a File File1 which constitutes of
>> three blocks B1, B2 and B3.
>> 1. Will the write process write B2 only after B1 is complete and B3 only
>> after B2 is complete or for a large file with many blocks, can this happen
>> in parallel? In all the hadoop documentation, I read this to be a
>> sequential operation. Does that mean for a file of 1TB, it takes three
>> times more time than a traditional file write? (due to default replication
>> factor of 3)
>> 2. Is it similar in the case of read as well?
>> Kindly someone please provide some clarity on this...
>> Regards
>> Vijay
> --
> Best Wishes!
> Yours, Zesheng

View raw message