Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DA86F10651 for ; Mon, 17 Mar 2014 06:41:12 +0000 (UTC) Received: (qmail 82896 invoked by uid 500); 17 Mar 2014 06:41:05 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 82316 invoked by uid 500); 17 Mar 2014 06:41:04 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 82308 invoked by uid 99); 17 Mar 2014 06:41:03 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 17 Mar 2014 06:41:03 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of tucu@cloudera.com designates 209.85.216.175 as permitted sender) Received: from [209.85.216.175] (HELO mail-qc0-f175.google.com) (209.85.216.175) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 17 Mar 2014 06:40:59 +0000 Received: by mail-qc0-f175.google.com with SMTP id e16so5375825qcx.6 for ; Sun, 16 Mar 2014 23:40:38 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=S6t9/sYUK0dEWgb6YGroFEJmXP5N1xqcD7vsIAqYxVo=; b=WbPqE2DX0igwJ9gp9iw3RE0gn5pcYRCmXUUUfFXgoPHndP1CvkIBi+YHVXmBeA2Uf7 C+HRegBfbpzvAuqD7u4ryrbFsH+O6bjM5/UrBMp7Bqwd30z6gf6dHTBXspXdWWhPr4ZH I1eDiCgjotTwCoMyvGTJ1g03nZJva2zytJeMPAM33XpoirJNggvPZdyeXl30dEu9Xr35 1bxrV3K+fUXcyg+i74KYswhEY+xak497Q0LFhh3agb4xc3d/VyfUpQqsKregex7SbZEV SP8HeNKU5eZr9Crij0MIrwMB19Z6is96bQmBfUCiRpfwz0F93rG9Jpy0cd/yYKagtL4o oClQ== X-Gm-Message-State: ALoCoQka4V0ld2FBI7kLtLtkxhyjnJ+WF1oNp94JzjGIFqq2xZQ10b2iMY3y4pck6vWfXB8tPlbN X-Received: by 10.229.118.4 with SMTP id t4mr26280075qcq.9.1395038438615; Sun, 16 Mar 2014 23:40:38 -0700 (PDT) MIME-Version: 1.0 Received: by 10.96.245.134 with HTTP; Sun, 16 Mar 2014 23:40:08 -0700 (PDT) In-Reply-To: References: From: Alejandro Abdelnur Date: Sun, 16 Mar 2014 23:40:08 -0700 Message-ID: Subject: Re: Data Locality and WebHDFS To: "common-user@hadoop.apache.org" Content-Type: multipart/alternative; boundary=001a1133045c951f1d04f4c7b2fa X-Virus-Checked: Checked by ClamAV on apache.org --001a1133045c951f1d04f4c7b2fa Content-Type: text/plain; charset=ISO-8859-1 I may have expressed myself wrong. You don't need to do any test to see how locality works with files of multiple blocks. If you are accessing a file of more than one block over webhdfs, you only have assured locality for the first block of the file. Thanks. On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling wrote: > Thank you, Mingjiang and Alejandro. > > This is interesting. Since we will use the data locality information for > scheduling, we could "hack" this to get the data locality information, at > least for the first block. As Alejandro says, we'd have to test what > happens for other data blocks -- e.g., what if, knowing the block sizes, we > request the second or third block? > > Interesting food for thought! I see some experiments in my future! > > Thanks! > > > On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur wrote: > >> well, this is for the first block of the file, the rest of the file >> (blocks being local or not) are streamed out by the same datanode. for >> small files (one block) you'll get locality, for large files only the first >> block, and by chance if other blocks are local to that datanode. >> >> >> Alejandro >> (phone typing) >> >> On Mar 16, 2014, at 18:53, Mingjiang Shi wrote: >> >> According to this page: >> http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/ >> >>> *Data Locality*: The file read and file write calls are redirected to >>> the corresponding datanodes. It uses the full bandwidth of the Hadoop >>> cluster for streaming data. >>> >>> *A HDFS Built-in Component*: WebHDFS is a first class built-in >>> component of HDFS. It runs inside Namenodes and Datanodes, therefore, it >>> can use all HDFS functionalities. It is a part of HDFS - there are no >>> additional servers to install >>> >> >> So it looks like the data locality is built-into webhdfs, client will be >> redirected to the data node automatically. >> >> >> >> >> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling wrote: >> >>> Hi all, >>> >>> I'm writing up a Google Summer of Code proposal to add HDFS support to >>> Disco, an Erlang MapReduce framework. >>> >>> We're interested in using WebHDFS. I have two questions: >>> >>> 1) Does WebHDFS allow querying data locality information? >>> >>> 2) If the data locality information is known, can data on specific data >>> nodes be accessed via Web HDFS? Or do all Web HDFS requests have to go >>> through a single server? >>> >>> Thanks, >>> RJ >>> >>> -- >>> em rnowling@gmail.com >>> c 954.496.2314 >>> >> >> >> >> -- >> Cheers >> -MJ >> >> > > > -- > em rnowling@gmail.com > c 954.496.2314 > -- Alejandro --001a1133045c951f1d04f4c7b2fa Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
I may have expressed myself wrong. You don't need to d= o any test to see how locality works with files of multiple blocks. If you = are accessing a file of more than one block over webhdfs, you only have ass= ured locality for the first block of the file.

Thanks.


On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <rnowli= ng@gmail.com> wrote:
Thank you, Mingjiang and Al= ejandro.

This is interesting.  Since we will use th= e data locality information for scheduling, we could "hack" this = to get the data locality information, at least for the first block.  A= s Alejandro says, we'd have to test what happens for other data blocks = -- e.g., what if, knowing the block sizes, we request the second or third b= lock?

Interesting food for thought!  I see some experime= nts in my future!  

Thanks!


On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <= tucu@cloudera.com> wrote:
well, this is for the= first block of the file, the rest of the file (blocks being local or not) = are streamed out by the same datanode. for small files (one block) you'= ll get locality, for large files only the first block, and by chance if oth= er blocks are local to that datanode. 


Alejandro
(phone typing)
According to this page: http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-acces= s-to-hdfs/

Data Locality<= /strong>: The file read and file write calls=20 are redirected to the corresponding datanodes. It uses the full=20 bandwidth of the Hadoop cluster for streaming data.

A HDFS Built-in Component: WebHDFS is a first class=20 built-in component of HDFS. It runs inside Namenodes and Datanodes,=20 therefore, it can use all HDFS functionalities. It is a part of HDFS &ndash= ;=20 there are no additional servers to install


So it looks like the data = locality is built-into webhdfs, client will be redirected to the data node = automatically.




On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rnowling@gmail.com> wrote:
Hi all,

= I'm writing up a Google Summer of Code proposal to add HDFS support to = Disco, an Erlang MapReduce framework.  

We're interested in using WebHDFS.  I have two= questions:

1) Does WebHDFS allow querying data locality informatio= n?

2) If the data locality information is known, c= an data on specific data nodes be accessed via Web HDFS?  Or do all We= b HDFS requests have to go through a single server?

Thanks,



--
C= heers
-MJ



--
em rnowling@gmail.com
c 954.496.2314



--
= Alejandro --001a1133045c951f1d04f4c7b2fa--