Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of rnowling@gmail.com designates
 74.125.82.46 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <1395081030.19619.YahooMailNeo@web125706.mail.ne1.yahoo.com>
References: 
 <CADtDQQ+yB-GGZKZ66zjXTMw7rO4KFkbP7ZFSbBAqWQ+tKD4oKQ@mail.gmail.com>
	<CADQoAEh+Km2d1LJ6giFiYQ-+w72bT=8GhsiAey+AoL7n=hTFPA@mail.gmail.com>
	<E0F0FEF1-629B-4A56-B4D6-A7AFB09D87A8@cloudera.com>
	<CADtDQQK7bVXv9tgKn9Mph5yfRHgf+=ihd9oYgiNqcECpDJgKKA@mail.gmail.com>
	<CAJs-t7NY_AMDtOpjTw6UbP=MmH_H7uZ=kxmZqiiuvFqZZGdv-w@mail.gmail.com>
	<CADtDQQKJc6zP3iXoreG5+cjWYP0M8=EG+NXDydqmvp_b7cemBg@mail.gmail.com>
	<56681234-BFCC-4293-A9D9-05547F53F9B3@cloudera.com>
	<ABADF5B7-7250-4CC6-8637-32F6590F6F25@cloudera.com>
	<1395081030.19619.YahooMailNeo@web125706.mail.ne1.yahoo.com>
Date: Mon, 17 Mar 2014 17:52:21 -0400
Message-ID: 
 <CADtDQQ+FApq00ssBdOQK3MvhgwL_6Cx5L9ASJ8jhwUrJGUkiTQ@mail.gmail.com>
Subject: Re: Data Locality and WebHDFS
From: RJ Nowling <rnowling@gmail.com>
To: user@hadoop.apache.org, Tsz Wo Sze <szetszwo@yahoo.com>
Content-Type: multipart/alternative; boundary=001a11c22908272dd004f4d46fa4

--001a11c22908272dd004f4d46fa4
Content-Type: text/plain; charset=ISO-8859-1

Thank you, Tsz.  That helps!


On Mon, Mar 17, 2014 at 2:30 PM, Tsz Wo Sze <szetszwo@yahoo.com> wrote:

> The file offset is considered in WebHDFS redirection.  It redirects to a
> datanode with the first block the client going to read, not the first block
> of the file.
>
> Hope it helps.
> Tsz-Wo
>
>
>   On Monday, March 17, 2014 10:09 AM, Alejandro Abdelnur <
> tucu@cloudera.com> wrote:
>
> actually, i am wrong, the webhdfs rest call has an offset.
>
> Alejandro
> (phone typing)
>
> On Mar 17, 2014, at 10:07, Alejandro Abdelnur <tucu@cloudera.com> wrote:
>
> dont recall how skips are handled in webhdfs, but i would assume that
> you'll get to the first block As usual, and the skip is handled by the DN
> serving the file (as webhdfs doesnot know at open that you'll skip)
>
> Alejandro
> (phone typing)
>
> On Mar 17, 2014, at 9:47, RJ Nowling <rnowling@gmail.com> wrote:
>
> Hi Alejandro,
>
> The WebHDFS API allows specifying an offset and length for the request.
>  If I specify an offset that start in the second block for a file (thus
> skipping the first block all together), will the namenode still direct me
> to a datanode with the first block or will it direct me to a namenode with
> the second block?  I.e., am I assured data locality only on the first block
> of the file (as you're saying) or on the first block I am accessing?
>
> If it is as you say, then I may want to reach out the WebHDFS developers
> and see if they would be interested in the additional functionality.
>
> Thank you,
> RJ
>
>
> On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur <tucu@cloudera.com>wrote:
>
> I may have expressed myself wrong. You don't need to do any test to see
> how locality works with files of multiple blocks. If you are accessing a
> file of more than one block over webhdfs, you only have assured locality
> for the first block of the file.
>
> Thanks.
>
>
> On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <rnowling@gmail.com> wrote:
>
> Thank you, Mingjiang and Alejandro.
>
> This is interesting.  Since we will use the data locality information for
> scheduling, we could "hack" this to get the data locality information, at
> least for the first block.  As Alejandro says, we'd have to test what
> happens for other data blocks -- e.g., what if, knowing the block sizes, we
> request the second or third block?
>
> Interesting food for thought!  I see some experiments in my future!
>
> Thanks!
>
>
> On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tucu@cloudera.com>wrote:
>
> well, this is for the first block of the file, the rest of the file
> (blocks being local or not) are streamed out by the same datanode. for
> small files (one block) you'll get locality, for large files only the first
> block, and by chance if other blocks are local to that datanode.
>
>
> Alejandro
> (phone typing)
>
> On Mar 16, 2014, at 18:53, Mingjiang Shi <mshi@gopivotal.com> wrote:
>
> According to this page:
> http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>
> *Data Locality*: The file read and file write calls are redirected to the
> corresponding datanodes. It uses the full bandwidth of the Hadoop cluster
> for streaming data.
> *A HDFS Built-in Component*: WebHDFS is a first class built-in component
> of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all
> HDFS functionalities. It is a part of HDFS - there are no additional
> servers to install
>
>
> So it looks like the data locality is built-into webhdfs, client will be
> redirected to the data node automatically.
>
>
>
>
> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rnowling@gmail.com> wrote:
>
> Hi all,
>
> I'm writing up a Google Summer of Code proposal to add HDFS support to
> Disco, an Erlang MapReduce framework.
>
> We're interested in using WebHDFS.  I have two questions:
>
> 1) Does WebHDFS allow querying data locality information?
>
> 2) If the data locality information is known, can data on specific data
> nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go
> through a single server?
>
> Thanks,
> RJ
>
> --
> em rnowling@gmail.com
> c 954.496.2314
>
>
>
>
> --
> Cheers
> -MJ
>
>
>
>
> --
> em rnowling@gmail.com
> c 954.496.2314
>
>
>
>
> --
> Alejandro
>
>
>
>
> --
> em rnowling@gmail.com
> c 954.496.2314
>
>
>
>


-- 
em rnowling@gmail.com
c 954.496.2314

--001a11c22908272dd004f4d46fa4
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Thank you, Tsz. &nbsp;That helps!</div><div class=3D"gmail=
_extra"><br><br><div class=3D"gmail_quote">On Mon, Mar 17, 2014 at 2:30 PM,=
 Tsz Wo Sze <span dir=3D"ltr">&lt;<a href=3D"mailto:szetszwo@yahoo.com" tar=
get=3D"_blank">szetszwo@yahoo.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div><div style=3D"font-size:12pt;font-famil=
y:HelveticaNeue,Helvetica Neue,Helvetica,Arial,Lucida Grande,sans-serif"><d=
iv>
<span style=3D"background-color:transparent">The&nbsp;</span><span style=3D=
"background-color:transparent;font-size:12pt">file offset is considered in<=
/span><span style=3D"background-color:transparent">&nbsp;</span><span style=
=3D"font-size:12pt">WebHDFS</span><span style=3D"background-color:transpare=
nt;font-size:12pt">&nbsp;redirection. &nbsp;It redirects to a datanode with=
 the first block the client going to read, not the first block of the file.=
</span></div>
<div style=3D"font-style:normal;font-size:12pt;background-color:transparent=
;font-family:HelveticaNeue,&#39;Helvetica Neue&#39;,Helvetica,Arial,&#39;Lu=
cida Grande&#39;,sans-serif"><span style=3D"background-color:transparent;fo=
nt-size:12pt"><br>
</span></div><div style=3D"font-style:normal;font-size:16px;background-colo=
r:transparent;font-family:HelveticaNeue,&#39;Helvetica Neue&#39;,Helvetica,=
Arial,&#39;Lucida Grande&#39;,sans-serif"><span style=3D"background-color:t=
ransparent;font-size:12pt">Hope it helps.</span></div>
<span class=3D"HOEnZb"><font color=3D"#888888"><div style=3D"font-style:nor=
mal;font-size:16px;background-color:transparent;font-family:HelveticaNeue,&=
#39;Helvetica Neue&#39;,Helvetica,Arial,&#39;Lucida Grande&#39;,sans-serif"=
><span style=3D"background-color:transparent;font-size:12pt">Tsz-Wo</span><=
/div>
</font></span><div><div class=3D"h5"><div style=3D"display:block"> <br> <br=
> <div style=3D"font-family:HelveticaNeue,&#39;Helvetica Neue&#39;,Helvetic=
a,Arial,&#39;Lucida Grande&#39;,sans-serif;font-size:12pt"> <div style=3D"f=
ont-family:HelveticaNeue,&#39;Helvetica Neue&#39;,Helvetica,Arial,&#39;Luci=
da Grande&#39;,sans-serif;font-size:12pt">
 <div dir=3D"ltr"> <font face=3D"Arial"> On Monday, March 17, 2014 10:09 AM=
, Alejandro Abdelnur &lt;<a href=3D"mailto:tucu@cloudera.com" target=3D"_bl=
ank">tucu@cloudera.com</a>&gt; wrote:<br> </font>
 </div> <blockquote style=3D"border-left:2px solid rgb(16,16,255);margin-le=
ft:5px;margin-top:5px;padding-left:5px">  <div><div><div><div>actually, i a=
m wrong, the webhdfs rest call has an offset.&nbsp;<br clear=3D"none"><br c=
lear=3D"none">
Alejandro<div>(phone typing)</div></div><div><div><br clear=3D"none">On Mar=
 17, 2014, at 10:07, Alejandro Abdelnur &lt;<a rel=3D"nofollow" shape=3D"re=
ct" href=3D"mailto:tucu@cloudera.com" target=3D"_blank">tucu@cloudera.com</=
a>&gt; wrote:<br clear=3D"none">
<br clear=3D"none"></div><blockquote type=3D"cite"><div></div></blockquote>=
</div></div><div><div><div><span>dont recall how skips are handled in webhd=
fs, but i would assume that you&#39;ll get to the first block As usual, and=
 the skip is handled by the DN serving the file (as
 webhdfs doesnot know at open that you&#39;ll skip)</span></div><div><br cl=
ear=3D"none"><span>Alejandro</span><div>(phone typing)</div></div><div><br =
clear=3D"none">On Mar 17, 2014, at 9:47, RJ Nowling &lt;<a rel=3D"nofollow"=
 shape=3D"rect" href=3D"mailto:rnowling@gmail.com" target=3D"_blank">rnowli=
ng@gmail.com</a>&gt; wrote:<br clear=3D"none">
<br clear=3D"none"></div><blockquote type=3D"cite"><div><div dir=3D"ltr">Hi=
 Alejandro,<div><br clear=3D"none"></div><div>The WebHDFS API allows specif=
ying an offset and length for the request. &nbsp;If I specify an offset tha=
t start in the second block for a file (thus skipping the first block all t=
ogether), will the namenode still direct me to a datanode with the first bl=
ock or will it direct me to a namenode with the second block? &nbsp;I.e., a=
m I assured data locality only on the first block of the file (as you&#39;r=
e saying) or on the first block I am accessing?</div>

<div><br clear=3D"none"></div><div>If it is as you say, then I may want to =
reach out the WebHDFS developers and see if they would be interested in the=
 additional functionality.</div><div><br clear=3D"none"></div><div>Thank yo=
u,</div>
<div>RJ</div></div>
<div><br clear=3D"none"><br clear=3D"none"><div>On Mon, Mar 17, 2014 at 2:4=
0 AM, Alejandro Abdelnur <span dir=3D"ltr">&lt;<a rel=3D"nofollow" shape=3D=
"rect" href=3D"mailto:tucu@cloudera.com" target=3D"_blank">tucu@cloudera.co=
m</a>&gt;</span> wrote:<br clear=3D"none">

<blockquote style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-l=
eft:1ex"><div dir=3D"ltr">I may have expressed myself wrong. You don&#39;t =
need to do any test to see how locality works with files of multiple blocks=
. If you are accessing a file of more than one block over webhdfs, you only=
 have assured locality for the first block of the file.<div>


<br clear=3D"none"></div><div>Thanks.</div></div><div><div><div><br clear=
=3D"none"><br clear=3D"none"><div>On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowli=
ng <span dir=3D"ltr">&lt;<a rel=3D"nofollow" shape=3D"rect" href=3D"mailto:=
rnowling@gmail.com" target=3D"_blank">rnowling@gmail.com</a>&gt;</span> wro=
te:<br clear=3D"none">


<blockquote style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-l=
eft:1ex"><div dir=3D"ltr">Thank you, Mingjiang and Alejandro.<div><br clear=
=3D"none"></div><div>This is interesting. &nbsp;Since we will use the data =
locality information for scheduling, we could &quot;hack&quot; this to get =
the data locality information, at least for the first block. &nbsp;As Aleja=
ndro says, we&#39;d have to test what happens for other data blocks -- e.g.=
, what if, knowing the block sizes, we request the second or third block?</=
div>


<div><br clear=3D"none"></div><div>Interesting food for thought! &nbsp;I se=
e some experiments in my future! &nbsp;</div><div><br clear=3D"none"></div>=
<div>Thanks!</div></div><div><div><div><br clear=3D"none"><br clear=3D"none=
"><div>

On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <span dir=3D"ltr">&lt;=
<a rel=3D"nofollow" shape=3D"rect" href=3D"mailto:tucu@cloudera.com" target=
=3D"_blank">tucu@cloudera.com</a>&gt;</span> wrote:<br clear=3D"none">
<blockquote style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-l=
eft:1ex"><div><div>well, this is for the first block of the file, the rest =
of the file (blocks being local or not) are streamed out by the same datano=
de. for small files (one block) you&#39;ll get locality, for large files on=
ly the first block, and by chance if other blocks are local to that datanod=
e.&nbsp;</div>


<div><br clear=3D"none"></div><div><br clear=3D"none">Alejandro<div>(phone =
typing)</div></div><div><div><div><br clear=3D"none">On Mar 16, 2014, at 18=
:53, Mingjiang Shi &lt;<a rel=3D"nofollow" shape=3D"rect" href=3D"mailto:ms=
hi@gopivotal.com" target=3D"_blank">mshi@gopivotal.com</a>&gt; wrote:<br cl=
ear=3D"none">

<br clear=3D"none"></div><blockquote type=3D"cite"><div><div dir=3D"ltr"><d=
iv style=3D"font-size:small">According to this page: <a rel=3D"nofollow" sh=
ape=3D"rect" href=3D"http://hortonworks.com/blog/webhdfs-%E2%80%93-http-res=
t-access-to-hdfs/" target=3D"_blank">http://hortonworks.com/blog/webhdfs-%E=
2%80%93-http-rest-access-to-hdfs/</a><br clear=3D"none">


<blockquote style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204=
,204,204);padding-left:1ex"><div><b>Data Locality</b>: The file read and fi=
le write calls=20
are redirected to the corresponding datanodes. It uses the full=20
bandwidth of the Hadoop cluster for streaming data.</div>
<div><b>A HDFS Built-in Component</b>: WebHDFS is a first class=20
built-in component of HDFS. It runs inside Namenodes and Datanodes,=20
therefore, it can use all HDFS functionalities. It is a part of HDFS &ndash=
;=20
there are no additional servers to install</div></blockquote><br clear=3D"n=
one"></div><div style=3D"font-size:small">So it looks like the data localit=
y is built-into webhdfs, client will be redirected to the data node automat=
ically. <br clear=3D"none">


<br clear=3D"none"><br clear=3D"none"></div></div><div><br clear=3D"none"><=
br clear=3D"none"><div>On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <span di=
r=3D"ltr">&lt;<a rel=3D"nofollow" shape=3D"rect" href=3D"mailto:rnowling@gm=
ail.com" target=3D"_blank">rnowling@gmail.com</a>&gt;</span> wrote:<br clea=
r=3D"none">


<blockquote style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-l=
eft:1ex"><div dir=3D"ltr">Hi all,<div><br clear=3D"none"></div><div>I&#39;m=
 writing up a Google Summer of Code proposal to add HDFS support to Disco, =
an Erlang MapReduce framework. &nbsp;</div>


<div><br clear=3D"none"></div><div>We&#39;re interested in using WebHDFS. &=
nbsp;I have two questions:</div>
<div><br clear=3D"none"></div><div>1) Does WebHDFS allow querying data loca=
lity information?</div><div><br clear=3D"none"></div><div>2) If the data lo=
cality information is known, can data on specific data nodes be accessed vi=
a Web HDFS? &nbsp;Or do all Web HDFS requests have to go through a single s=
erver?</div>


<div><br clear=3D"none">Thanks,</div><div>RJ<span><font color=3D"#888888"><=
br clear=3D"all"></font></span><div><br clear=3D"none"></div>-- <br clear=
=3D"none">em <a rel=3D"nofollow" shape=3D"rect" href=3D"mailto:rnowling@gma=
il.com" target=3D"_blank">rnowling@gmail.com</a><br clear=3D"none">
c <a rel=3D"nofollow" shape=3D"rect">954.496.2314</a>
</div></div>
</blockquote></div><br clear=3D"none"><br clear=3D"all"><br clear=3D"none">=
-- <br clear=3D"none"><div dir=3D"ltr"><div>Cheers<br clear=3D"none"></div>=
-MJ<br clear=3D"none"></div>
</div>
</div></blockquote></div></div></div></blockquote></div><br clear=3D"none">=
<br clear=3D"all"><div><br clear=3D"none"></div>-- <br clear=3D"none">em <a=
 rel=3D"nofollow" shape=3D"rect" href=3D"mailto:rnowling@gmail.com" target=
=3D"_blank">rnowling@gmail.com</a><br clear=3D"none">
c <a rel=3D"nofollow" shape=3D"rect">954.496.2314</a>
</div>
</div></div></blockquote></div><br clear=3D"none"><br clear=3D"all"><div><b=
r clear=3D"none"></div></div></div><span><font color=3D"#888888">-- <br cle=
ar=3D"none">Alejandro
</font></span></div>
</blockquote></div><br clear=3D"none"><br clear=3D"all"><div><br clear=3D"n=
one"></div>-- <br clear=3D"none">em <a rel=3D"nofollow" shape=3D"rect" href=
=3D"mailto:rnowling@gmail.com" target=3D"_blank">rnowling@gmail.com</a><br =
clear=3D"none">
c <a href=3D"tel:954.496.2314" value=3D"+19544962314" target=3D"_blank">954=
.496.2314</a>
</div>
</div></blockquote></div></div></div><br><br></div> </blockquote>  </div> <=
/div>   </div> </div></div></div></div></blockquote></div><br><br clear=3D"=
all"><div><br></div>-- <br>em <a href=3D"mailto:rnowling@gmail.com">rnowlin=
g@gmail.com</a><br>
c 954.496.2314
</div>

--001a11c22908272dd004f4d46fa4--