Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7956F118CA for ; Fri, 20 Jun 2014 05:55:13 +0000 (UTC) Received: (qmail 79203 invoked by uid 500); 20 Jun 2014 05:55:08 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 79092 invoked by uid 500); 20 Jun 2014 05:55:08 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 79081 invoked by uid 99); 20 Jun 2014 05:55:08 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Jun 2014 05:55:08 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of vijay.bhoomireddy@gmail.com designates 209.85.215.52 as permitted sender) Received: from [209.85.215.52] (HELO mail-la0-f52.google.com) (209.85.215.52) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Jun 2014 05:55:03 +0000 Received: by mail-la0-f52.google.com with SMTP id ty20so2030593lab.39 for ; Thu, 19 Jun 2014 22:54:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=XN4F576/0Pt5cffHPEgkmNW0s45vPjV5J5F3diBdLUA=; b=QJaN59LcahtjsoL92ANGZ+s6YzLImKGzX9whUGfIwfRLbjzq4U+ZcALiYa+A/+vFKB OZaAFotwaYetcEC2i/Mf4+zmb217CVMrk6zLT1ueMv1QKfYS5uyYbUgJVudXB55kdvg1 7ppBFXKN6/bH10ndYGu/v6g7SLNfDQHhaCID9BUo4YAMEpUt5/r4IPAbGQuRoDltEWoc toA4NXr2CNOR80btWq+UHr2MIL4xpI08m8ufOowC+0cL4nTr9diIB+w6cBEq7qt8ABYZ jeTtC37EMCqvoaQms1idfDjKJn043DBufEroeJY/ZrVTO97Rf7Y0eH0EHTT+MG441G4T LGqw== MIME-Version: 1.0 X-Received: by 10.112.205.135 with SMTP id lg7mr926056lbc.52.1403243681748; Thu, 19 Jun 2014 22:54:41 -0700 (PDT) Received: by 10.152.88.78 with HTTP; Thu, 19 Jun 2014 22:54:41 -0700 (PDT) In-Reply-To: References: Date: Fri, 20 Jun 2014 11:24:41 +0530 Message-ID: Subject: Re: HDFS File Writes & Reads From: Vijaya Narayana Reddy Bhoomi Reddy To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a11c3c9e02f1b0404fc3e21b5 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c3c9e02f1b0404fc3e21b5 Content-Type: text/plain; charset=UTF-8 Yong, Thanks for the clarification. It was more of an academic query. We do not have any performance requirements at this stage. Regards Vijay On 19 June 2014 19:05, java8964 wrote: > What your understanding is almost correct, but not with the part your > highlighted. > > The HDFS is not designed for write performance, but the client doesn't > have to wait for the acknowledgment of previous packets before sending the > next packets. > > This webpage describes it clearly, and hope it is helpful for you. > > http://aosabook.org/en/hdfs.html > > Quoted > > The next packet can be pushed to the pipeline before receiving the > acknowledgment for the previous packets. The number of outstanding packets > is limited by the outstanding packets window size of the client. > > Do you have any requirements of performance of ingesting data into HDFS? > > Yong > > ------------------------------ > Date: Thu, 19 Jun 2014 11:51:43 +0530 > Subject: Re: HDFS File Writes & Reads > From: vijay.bhoomireddy@gmail.com > To: user@hadoop.apache.org > > > @Zeshen Wu,Thanks for the response. > > I still don't understand how HDFS reduces the time to write and read a > file, compared to a traditional file read / write mechanism. > > For example, if I am writing a file, using the default configurations, > Hadoop internally has to write each block to 3 data nodes. My understanding > is that for each block, first the client writes the block to the first data > node in the pipeline which will then inform the second and so on. Once the > third data node successfully receives the block, it provides an > acknowledgement back to data node 2 and finally to the client through Data > node 1. *Only after receiving the acknowledgement for the block, the > write is considered successful and the client proceeds to write the next > block.* > > If this is the case, then the time taken to write each block is 3 times > than the normal write due to the replication factor and the write process > is happening sequentially block after block. > > Please correct me if I am wrong in my understanding. Also, the following > questions below: > > 1. My understanding is that File read / write in Hadoop doesn't have any > parallelism and the best it can perform is same to a traditional file read > or write + some overhead involved in the distributed communication > mechanism. > 2. Parallelism is provided only during the data processing phase via Map > Reduce, but not during file read / write by a client. > > Regards > Vijay > > > > On 17 June 2014 19:37, Zesheng Wu wrote: > > 1. HDFS doesn't allow parallel write > 2. HDFS use pipeline to write multiple replicas, so it doesn't take three > times more time than a traditional file write > 3. HDFS allow parallel read > > > 2014-06-17 19:17 GMT+08:00 Vijaya Narayana Reddy Bhoomi Reddy < > vijay.bhoomireddy@gmail.com>: > > Hi, > > I have a basic question regarding file writes and reads in HDFS. Is the > file write and read process a sequential activity or executed in parallel? > > For example, lets assume that there is a File File1 which constitutes of > three blocks B1, B2 and B3. > > 1. Will the write process write B2 only after B1 is complete and B3 only > after B2 is complete or for a large file with many blocks, can this happen > in parallel? In all the hadoop documentation, I read this to be a > sequential operation. Does that mean for a file of 1TB, it takes three > times more time than a traditional file write? (due to default replication > factor of 3) > 2. Is it similar in the case of read as well? > > Kindly someone please provide some clarity on this... > > Regards > Vijay > > > > > -- > Best Wishes! > > Yours, Zesheng > > > --001a11c3c9e02f1b0404fc3e21b5 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Yong,

Thanks for the clarification. It = was more of an academic query. We do not have any performance requirements = at this stage.

Regards
Vijay


On 19 June 20= 14 19:05, java8964 <java8964@hotmail.com> wrote:
What your understanding is almost correct, but not wi= th the part your highlighted.

The HDFS is not designed f= or write performance, but the client doesn't have to wait for the ackno= wledgment of previous packets before sending the next packets.

This webpage describes it clearly, and hope it is helpf= ul for you.

Quoted

= The next packet can be pushed to the pipeline before receiving the acknowle= dgment for the previous packets. The number of outstanding packets is limit= ed by the outstanding packets window size of the client.

Do you have any requirements of performance = of ingesting data into HDFS?

Yong


Date: Thu, 19 Jun 2014 11:51:43 +0530
Subject= : Re: HDFS File Writes & Reads
From: vijay.bhoomireddy@gmail.com
To: user@hadoop.apach= e.org


@Zeshen Wu,Thanks for the r= esponse.

I still don't understand how HDFS red= uces the time to write and read a file, compared to a traditional file read= / write mechanism.=C2=A0

For example, if I am writing a file, using the default = configurations, Hadoop internally has to write each block to 3 data nodes. = My understanding is that for each block, first the client writes the block = to the first data node in the pipeline which will then inform the second an= d so on. Once the third data node successfully receives the block, it provi= des an acknowledgement back to data node 2 and finally to the client throug= h Data node 1. Only after receiving the acknowledgement for the block, t= he write is considered successful and the client proceeds to write the next= block.

If this is the case, then the time taken to writ= e each block is 3 times than the normal write due to the replication factor= and the write process is happening sequentially block after block.

Please correct me if I am wrong in my understanding. Al= so, the following questions below:

1. My understan= ding is that File read / write in Hadoop doesn't have any parallelism a= nd the best it can perform is same to a traditional file read or write + so= me overhead involved in the distributed communication mechanism.
2. Parallelism is provided only during the data processing phase via M= ap Reduce, but not during file read / write by a client.

Regards
Vijay



On 17 June 2014 19:37, Zesheng Wu <wuzesheng86@gmail.com<= /a>> wrote:
1. HDFS doesn't allow parallel write
2. HDFS use p= ipeline to write multiple replicas, so it doesn't take three times more= time than a traditional file write
3. HDFS allow parallel read


2014-06-17 19:17 GMT+08:00 Vijaya Narayana Reddy Bh= oomi Reddy <vijay.bhoomireddy@gmail.com>:

Hi,

I have= a basic question regarding file writes and reads in HDFS. Is the file writ= e and read process a sequential activity or executed in parallel?

For example, lets assu= me that there is a File File1 which constitutes of three blocks B1, B2 and = B3.=C2=A0

1. Will the write proc= ess write B2 only after B1 is complete and B3 only after B2 is complete or = for a large file with many blocks, can this happen in parallel? In all the = hadoop documentation, I read this to be a sequential operation. Does that m= ean for a file of 1TB, it takes three times more time than a traditional fi= le write? (due to default replication factor of 3)
2. Is it similar= in the case of read as well?

Kindly someone please provide some clarity on this...

Regards
Vijay



--
Best Wishes!

Yours, Zesheng


--001a11c3c9e02f1b0404fc3e21b5--