Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BE6EB10D49 for ; Mon, 17 Mar 2014 21:53:06 +0000 (UTC) Received: (qmail 77641 invoked by uid 500); 17 Mar 2014 21:52:54 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 77474 invoked by uid 500); 17 Mar 2014 21:52:50 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 77306 invoked by uid 99); 17 Mar 2014 21:52:47 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 17 Mar 2014 21:52:47 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of rnowling@gmail.com designates 74.125.82.46 as permitted sender) Received: from [74.125.82.46] (HELO mail-wg0-f46.google.com) (74.125.82.46) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 17 Mar 2014 21:52:43 +0000 Received: by mail-wg0-f46.google.com with SMTP id b13so5250578wgh.5 for ; Mon, 17 Mar 2014 14:52:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=F5iBGP6jiQemyiPhcGM88ukaYhq/tzLlAIWIlq/aOME=; b=HxZCf5m/V0uk+5virCJWcBsB/+ecDAPqtIjpNfIlXL8/588Qs/kkAxKscmC/zdfF1f mASiThsJ1x4fA21AqJfrytGCeyA7TUDHrAsZu1i0BxRJBbq/KKZrI0/gScbDIhPVgiEz 9swAkSCTzt34/P2vAmF4h+/LikC4K9ECUoyhPOrf11y4TZknRqdRi48EY8/dpkLoJB+l 5SCsBfvHXSMUqRL95t+bcrW86Q273VpyCUgBeC0TECkafwWoChPqPXvRzkXWtPV+CufB 97wof1SUlA3G6BWTOefnz1ZIKSPWUbS3x1XeRvkfB3OhaLgT8iAtF1LJbuNbkXowYf4D 8IMg== MIME-Version: 1.0 X-Received: by 10.180.189.43 with SMTP id gf11mr11278306wic.32.1395093141923; Mon, 17 Mar 2014 14:52:21 -0700 (PDT) Received: by 10.194.250.1 with HTTP; Mon, 17 Mar 2014 14:52:21 -0700 (PDT) In-Reply-To: <1395081030.19619.YahooMailNeo@web125706.mail.ne1.yahoo.com> References: <56681234-BFCC-4293-A9D9-05547F53F9B3@cloudera.com> <1395081030.19619.YahooMailNeo@web125706.mail.ne1.yahoo.com> Date: Mon, 17 Mar 2014 17:52:21 -0400 Message-ID: Subject: Re: Data Locality and WebHDFS From: RJ Nowling To: user@hadoop.apache.org, Tsz Wo Sze Content-Type: multipart/alternative; boundary=001a11c22908272dd004f4d46fa4 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c22908272dd004f4d46fa4 Content-Type: text/plain; charset=ISO-8859-1 Thank you, Tsz. That helps! On Mon, Mar 17, 2014 at 2:30 PM, Tsz Wo Sze wrote: > The file offset is considered in WebHDFS redirection. It redirects to a > datanode with the first block the client going to read, not the first block > of the file. > > Hope it helps. > Tsz-Wo > > > On Monday, March 17, 2014 10:09 AM, Alejandro Abdelnur < > tucu@cloudera.com> wrote: > > actually, i am wrong, the webhdfs rest call has an offset. > > Alejandro > (phone typing) > > On Mar 17, 2014, at 10:07, Alejandro Abdelnur wrote: > > dont recall how skips are handled in webhdfs, but i would assume that > you'll get to the first block As usual, and the skip is handled by the DN > serving the file (as webhdfs doesnot know at open that you'll skip) > > Alejandro > (phone typing) > > On Mar 17, 2014, at 9:47, RJ Nowling wrote: > > Hi Alejandro, > > The WebHDFS API allows specifying an offset and length for the request. > If I specify an offset that start in the second block for a file (thus > skipping the first block all together), will the namenode still direct me > to a datanode with the first block or will it direct me to a namenode with > the second block? I.e., am I assured data locality only on the first block > of the file (as you're saying) or on the first block I am accessing? > > If it is as you say, then I may want to reach out the WebHDFS developers > and see if they would be interested in the additional functionality. > > Thank you, > RJ > > > On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur wrote: > > I may have expressed myself wrong. You don't need to do any test to see > how locality works with files of multiple blocks. If you are accessing a > file of more than one block over webhdfs, you only have assured locality > for the first block of the file. > > Thanks. > > > On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling wrote: > > Thank you, Mingjiang and Alejandro. > > This is interesting. Since we will use the data locality information for > scheduling, we could "hack" this to get the data locality information, at > least for the first block. As Alejandro says, we'd have to test what > happens for other data blocks -- e.g., what if, knowing the block sizes, we > request the second or third block? > > Interesting food for thought! I see some experiments in my future! > > Thanks! > > > On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur wrote: > > well, this is for the first block of the file, the rest of the file > (blocks being local or not) are streamed out by the same datanode. for > small files (one block) you'll get locality, for large files only the first > block, and by chance if other blocks are local to that datanode. > > > Alejandro > (phone typing) > > On Mar 16, 2014, at 18:53, Mingjiang Shi wrote: > > According to this page: > http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/ > > *Data Locality*: The file read and file write calls are redirected to the > corresponding datanodes. It uses the full bandwidth of the Hadoop cluster > for streaming data. > *A HDFS Built-in Component*: WebHDFS is a first class built-in component > of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all > HDFS functionalities. It is a part of HDFS - there are no additional > servers to install > > > So it looks like the data locality is built-into webhdfs, client will be > redirected to the data node automatically. > > > > > On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling wrote: > > Hi all, > > I'm writing up a Google Summer of Code proposal to add HDFS support to > Disco, an Erlang MapReduce framework. > > We're interested in using WebHDFS. I have two questions: > > 1) Does WebHDFS allow querying data locality information? > > 2) If the data locality information is known, can data on specific data > nodes be accessed via Web HDFS? Or do all Web HDFS requests have to go > through a single server? > > Thanks, > RJ > > -- > em rnowling@gmail.com > c 954.496.2314 > > > > > -- > Cheers > -MJ > > > > > -- > em rnowling@gmail.com > c 954.496.2314 > > > > > -- > Alejandro > > > > > -- > em rnowling@gmail.com > c 954.496.2314 > > > > -- em rnowling@gmail.com c 954.496.2314 --001a11c22908272dd004f4d46fa4 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Thank you, Tsz.  That helps!


On Mon, Mar 17, 2014 at 2:30 PM,= Tsz Wo Sze <szetszwo@yahoo.com> wrote:
The file offset is considered in<= /span> WebHDFS redirection.  It redirects to a datanode with= the first block the client going to read, not the first block of the file.=

Hope it helps.
Tsz-Wo<= /div>

On Monday, March 17, 2014 10:09 AM= , Alejandro Abdelnur <tucu@cloudera.com> wrote:
actually, i a= m wrong, the webhdfs rest call has an offset. 

Alejandro
(phone typing)

On Mar= 17, 2014, at 10:07, Alejandro Abdelnur <tucu@cloudera.com> wrote:

=
Hi= Alejandro,

The WebHDFS API allows specif= ying an offset and length for the request.  If I specify an offset tha= t start in the second block for a file (thus skipping the first block all t= ogether), will the namenode still direct me to a datanode with the first bl= ock or will it direct me to a namenode with the second block?  I.e., a= m I assured data locality only on the first block of the file (as you'r= e saying) or on the first block I am accessing?

If it is as you say, then I may want to = reach out the WebHDFS developers and see if they would be interested in the= additional functionality.

Thank yo= u,
RJ


On Mon, Mar 17, 2014 at 2:4= 0 AM, Alejandro Abdelnur <tucu@cloudera.co= m> wrote:
I may have expressed myself wrong. You don't = need to do any test to see how locality works with files of multiple blocks= . If you are accessing a file of more than one block over webhdfs, you only= have assured locality for the first block of the file.

Thanks.


On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowli= ng <rnowling@gmail.com> wro= te:
Thank you, Mingjiang and Alejandro.

This is interesting.  Since we will use the data = locality information for scheduling, we could "hack" this to get = the data locality information, at least for the first block.  As Aleja= ndro says, we'd have to test what happens for other data blocks -- e.g.= , what if, knowing the block sizes, we request the second or third block?

Interesting food for thought!  I se= e some experiments in my future!  

=
Thanks!


On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <= tucu@cloudera.com> wrote:
well, this is for the first block of the file, the rest = of the file (blocks being local or not) are streamed out by the same datano= de. for small files (one block) you'll get locality, for large files on= ly the first block, and by chance if other blocks are local to that datanod= e. 


Alejandro
(phone = typing)

On Mar 16, 2014, at 18= :53, Mingjiang Shi <mshi@gopivotal.com> wrote:

According to this page: http://hortonworks.com/blog/webhdfs-%E= 2%80%93-http-rest-access-to-hdfs/
Data Locality: The file read and fi= le write calls=20 are redirected to the corresponding datanodes. It uses the full=20 bandwidth of the Hadoop cluster for streaming data.
A HDFS Built-in Component: WebHDFS is a first class=20 built-in component of HDFS. It runs inside Namenodes and Datanodes,=20 therefore, it can use all HDFS functionalities. It is a part of HDFS &ndash= ;=20 there are no additional servers to install

So it looks like the data localit= y is built-into webhdfs, client will be redirected to the data node automat= ically.



<= br clear=3D"none">
On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rnowling@gmail.com> wrote:
Hi all,

I'm= writing up a Google Summer of Code proposal to add HDFS support to Disco, = an Erlang MapReduce framework.  

We're interested in using WebHDFS. &= nbsp;I have two questions:

1) Does WebHDFS allow querying data loca= lity information?

2) If the data lo= cality information is known, can data on specific data nodes be accessed vi= a Web HDFS?  Or do all Web HDFS requests have to go through a single s= erver?

Thanks,
RJ<= br clear=3D"all">

--
em rnowling@gmail.com
c 954.496.2314



= --
Cheers
= -MJ

=

--
em rnowling@gmail.com
c 954.496.2314


--
Alejandro



--
em rnowling@gmail.com
c 954= .496.2314


<= /div>


--
em rnowlin= g@gmail.com
c 954.496.2314 --001a11c22908272dd004f4d46fa4--