Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of tucu@cloudera.com designates
 209.85.220.42 as permitted sender)
Subject: Re: Data Locality and WebHDFS
References: 
 <CADtDQQ+yB-GGZKZ66zjXTMw7rO4KFkbP7ZFSbBAqWQ+tKD4oKQ@mail.gmail.com>
 <CADQoAEh+Km2d1LJ6giFiYQ-+w72bT=8GhsiAey+AoL7n=hTFPA@mail.gmail.com>
 <E0F0FEF1-629B-4A56-B4D6-A7AFB09D87A8@cloudera.com>
 <CADtDQQK7bVXv9tgKn9Mph5yfRHgf+=ihd9oYgiNqcECpDJgKKA@mail.gmail.com>
 <CAJs-t7NY_AMDtOpjTw6UbP=MmH_H7uZ=kxmZqiiuvFqZZGdv-w@mail.gmail.com>
 <CADtDQQKJc6zP3iXoreG5+cjWYP0M8=EG+NXDydqmvp_b7cemBg@mail.gmail.com>
 <56681234-BFCC-4293-A9D9-05547F53F9B3@cloudera.com>
From: Alejandro Abdelnur <tucu@cloudera.com>
Content-Type: multipart/alternative;
	boundary=Apple-Mail-8A5AF459-1016-4882-BECD-B39FD632B08E
In-Reply-To: <56681234-BFCC-4293-A9D9-05547F53F9B3@cloudera.com>
Message-Id: <ABADF5B7-7250-4CC6-8637-32F6590F6F25@cloudera.com>
Date: Mon, 17 Mar 2014 10:09:19 -0700
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (1.0)


--Apple-Mail-8A5AF459-1016-4882-BECD-B39FD632B08E
Content-Type: text/plain;
	charset=utf-8
Content-Transfer-Encoding: quoted-printable

actually, i am wrong, the webhdfs rest call has an offset.=20

Alejandro
(phone typing)

> On Mar 17, 2014, at 10:07, Alejandro Abdelnur <tucu@cloudera.com> wrote:
>=20
> dont recall how skips are handled in webhdfs, but i would assume that you'=
ll get to the first block As usual, and the skip is handled by the DN servin=
g the file (as webhdfs doesnot know at open that you'll skip)
>=20
> Alejandro
> (phone typing)
>=20
>> On Mar 17, 2014, at 9:47, RJ Nowling <rnowling@gmail.com> wrote:
>>=20
>> Hi Alejandro,
>>=20
>> The WebHDFS API allows specifying an offset and length for the request.  I=
f I specify an offset that start in the second block for a file (thus skippi=
ng the first block all together), will the namenode still direct me to a dat=
anode with the first block or will it direct me to a namenode with the secon=
d block?  I.e., am I assured data locality only on the first block of the fi=
le (as you're saying) or on the first block I am accessing?
>>=20
>> If it is as you say, then I may want to reach out the WebHDFS developers a=
nd see if they would be interested in the additional functionality.
>>=20
>> Thank you,
>> RJ
>>=20
>>=20
>>> On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur <tucu@cloudera.com> w=
rote:
>>> I may have expressed myself wrong. You don't need to do any test to see h=
ow locality works with files of multiple blocks. If you are accessing a file=
 of more than one block over webhdfs, you only have assured locality for the=
 first block of the file.
>>>=20
>>> Thanks.
>>>=20
>>>=20
>>>> On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <rnowling@gmail.com> wrote:=

>>>> Thank you, Mingjiang and Alejandro.
>>>>=20
>>>> This is interesting.  Since we will use the data locality information f=
or scheduling, we could "hack" this to get the data locality information, at=
 least for the first block.  As Alejandro says, we'd have to test what happe=
ns for other data blocks -- e.g., what if, knowing the block sizes, we reque=
st the second or third block?
>>>>=20
>>>> Interesting food for thought!  I see some experiments in my future! =20=

>>>>=20
>>>> Thanks!
>>>>=20
>>>>=20
>>>>> On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tucu@cloudera.co=
m> wrote:
>>>>> well, this is for the first block of the file, the rest of the file (b=
locks being local or not) are streamed out by the same datanode. for small f=
iles (one block) you'll get locality, for large files only the first block, a=
nd by chance if other blocks are local to that datanode.=20
>>>>>=20
>>>>>=20
>>>>> Alejandro
>>>>> (phone typing)
>>>>>=20
>>>>>> On Mar 16, 2014, at 18:53, Mingjiang Shi <mshi@gopivotal.com> wrote:
>>>>>>=20
>>>>>> According to this page: http://hortonworks.com/blog/webhdfs-%E2%80%93=
-http-rest-access-to-hdfs/
>>>>>>> Data Locality: The file read and file write calls are redirected to t=
he corresponding datanodes. It uses the full bandwidth of the Hadoop cluster=
 for streaming data.
>>>>>>>=20
>>>>>>> A HDFS Built-in Component: WebHDFS is a first class built-in compone=
nt of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use al=
l HDFS functionalities. It is a part of HDFS =E2=80=93 there are no addition=
al servers to install
>>>>>>>=20
>>>>>>=20
>>>>>> So it looks like the data locality is built-into webhdfs, client will=
 be redirected to the data node automatically.=20
>>>>>>=20
>>>>>>=20
>>>>>>=20
>>>>>>=20
>>>>>>> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rnowling@gmail.com> wro=
te:
>>>>>>> Hi all,
>>>>>>>=20
>>>>>>> I'm writing up a Google Summer of Code proposal to add HDFS support t=
o Disco, an Erlang MapReduce framework. =20
>>>>>>>=20
>>>>>>> We're interested in using WebHDFS.  I have two questions:
>>>>>>>=20
>>>>>>> 1) Does WebHDFS allow querying data locality information?
>>>>>>>=20
>>>>>>> 2) If the data locality information is known, can data on specific d=
ata nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go t=
hrough a single server?
>>>>>>>=20
>>>>>>> Thanks,
>>>>>>> RJ
>>>>>>>=20
>>>>>>> --=20
>>>>>>> em rnowling@gmail.com
>>>>>>> c 954.496.2314
>>>>>>=20
>>>>>>=20
>>>>>>=20
>>>>>> --=20
>>>>>> Cheers
>>>>>> -MJ
>>>>=20
>>>>=20
>>>>=20
>>>> --=20
>>>> em rnowling@gmail.com
>>>> c 954.496.2314
>>>=20
>>>=20
>>>=20
>>> --=20
>>> Alejandro
>>=20
>>=20
>>=20
>> --=20
>> em rnowling@gmail.com
>> c 954.496.2314

--Apple-Mail-8A5AF459-1016-4882-BECD-B39FD632B08E
Content-Type: text/html;
	charset=utf-8
Content-Transfer-Encoding: quoted-printable

<html><head><meta http-equiv=3D"content-type" content=3D"text/html; charset=3D=
utf-8"></head><body dir=3D"auto"><div>actually, i am wrong, the webhdfs rest=
 call has an offset.&nbsp;<br><br>Alejandro<div>(phone typing)</div></div><d=
iv><br>On Mar 17, 2014, at 10:07, Alejandro Abdelnur &lt;<a href=3D"mailto:t=
ucu@cloudera.com">tucu@cloudera.com</a>&gt; wrote:<br><br></div><blockquote t=
ype=3D"cite"><div><meta http-equiv=3D"content-type" content=3D"text/html; ch=
arset=3Dutf-8"><div><span style=3D"-webkit-text-size-adjust: auto;">dont rec=
all how skips are handled in webhdfs, but i would assume that you'll get to t=
he first block As usual, and the skip is handled by the DN serving the file (=
as webhdfs doesnot know at open that you'll skip)</span></div><div><br><span=
 style=3D"-webkit-text-size-adjust: auto;">Alejandro</span><div style=3D"-we=
bkit-text-size-adjust: auto;">(phone typing)</div></div><div style=3D"-webki=
t-text-size-adjust: auto;"><br>On Mar 17, 2014, at 9:47, RJ Nowling &lt;<a h=
ref=3D"mailto:rnowling@gmail.com">rnowling@gmail.com</a>&gt; wrote:<br><br><=
/div><blockquote type=3D"cite" style=3D"-webkit-text-size-adjust: auto;"><di=
v><div dir=3D"ltr">Hi Alejandro,<div><br></div><div>The WebHDFS API allows s=
pecifying an offset and length for the request. &nbsp;If I specify an offset=
 that start in the second block for a file (thus skipping the first block al=
l together), will the namenode still direct me to a datanode with the first b=
lock or will it direct me to a namenode with the second block? &nbsp;I.e., a=
m I assured data locality only on the first block of the file (as you're say=
ing) or on the first block I am accessing?</div>
<div><br></div><div>If it is as you say, then I may want to reach out the We=
bHDFS developers and see if they would be interested in the additional funct=
ionality.</div><div><br></div><div>Thank you,</div><div>RJ</div></div>
<div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Mon, Mar 17=
, 2014 at 2:40 AM, Alejandro Abdelnur <span dir=3D"ltr">&lt;<a href=3D"mailt=
o:tucu@cloudera.com" target=3D"_blank">tucu@cloudera.com</a>&gt;</span> wrot=
e:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px=
 #ccc solid;padding-left:1ex"><div dir=3D"ltr">I may have expressed myself w=
rong. You don't need to do any test to see how locality works with files of m=
ultiple blocks. If you are accessing a file of more than one block over webh=
dfs, you only have assured locality for the first block of the file.<div>


<br></div><div>Thanks.</div></div><div class=3D"gmail_extra"><div><div class=
=3D"h5"><br><br><div class=3D"gmail_quote">On Sun, Mar 16, 2014 at 9:18 PM, R=
J Nowling <span dir=3D"ltr">&lt;<a href=3D"mailto:rnowling@gmail.com" target=
=3D"_blank">rnowling@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px=
 #ccc solid;padding-left:1ex"><div dir=3D"ltr">Thank you, Mingjiang and Alej=
andro.<div><br></div><div>This is interesting. &nbsp;Since we will use the d=
ata locality information for scheduling, we could "hack" this to get the dat=
a locality information, at least for the first block. &nbsp;As Alejandro say=
s, we'd have to test what happens for other data blocks -- e.g., what if, kn=
owing the block sizes, we request the second or third block?</div>


<div><br></div><div>Interesting food for thought! &nbsp;I see some experimen=
ts in my future! &nbsp;</div><div><br></div><div>Thanks!</div></div><div><di=
v><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">

On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <span dir=3D"ltr">&lt;<=
a href=3D"mailto:tucu@cloudera.com" target=3D"_blank">tucu@cloudera.com</a>&=
gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px=
 #ccc solid;padding-left:1ex"><div dir=3D"auto"><div>well, this is for the f=
irst block of the file, the rest of the file (blocks being local or not) are=
 streamed out by the same datanode. for small files (one block) you'll get l=
ocality, for large files only the first block, and by chance if other blocks=
 are local to that datanode.&nbsp;</div>


<div><br></div><div><br>Alejandro<div>(phone typing)</div></div><div><div><d=
iv><br>On Mar 16, 2014, at 18:53, Mingjiang Shi &lt;<a href=3D"mailto:mshi@g=
opivotal.com" target=3D"_blank">mshi@gopivotal.com</a>&gt; wrote:<br>
<br></div><blockquote type=3D"cite"><div><div dir=3D"ltr"><div class=3D"gmai=
l_default" style=3D"font-size:small">According to this page: <a href=3D"http=
://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/" target=3D=
"_blank">http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-h=
dfs/</a><br>


<blockquote style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,=
204,204);padding-left:1ex" class=3D"gmail_quote"><p><strong>Data Locality</s=
trong>: The file read and file write calls=20
are redirected to the corresponding datanodes. It uses the full=20
bandwidth of the Hadoop cluster for streaming data.</p>
<p><strong>A HDFS Built-in Component</strong>: WebHDFS is a first class=20
built-in component of HDFS. It runs inside Namenodes and Datanodes,=20
therefore, it can use all HDFS functionalities. It is a part of HDFS =E2=80=93=
=20
there are no additional servers to install</p></blockquote><br></div><div cl=
ass=3D"gmail_default" style=3D"font-size:small">So it looks like the data lo=
cality is built-into webhdfs, client will be redirected to the data node aut=
omatically. <br>


<br><br></div></div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_q=
uote">On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <span dir=3D"ltr">&lt;<a h=
ref=3D"mailto:rnowling@gmail.com" target=3D"_blank">rnowling@gmail.com</a>&g=
t;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px=
 #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi all,<div><br></div><div>I'=
m writing up a Google Summer of Code proposal to add HDFS support to Disco, a=
n Erlang MapReduce framework. &nbsp;</div>


<div><br></div><div>We're interested in using WebHDFS. &nbsp;I have two ques=
tions:</div>
<div><br></div><div>1) Does WebHDFS allow querying data locality information=
?</div><div><br></div><div>2) If the data locality information is known, can=
 data on specific data nodes be accessed via Web HDFS? &nbsp;Or do all Web H=
DFS requests have to go through a single server?</div>


<div><br>Thanks,</div><div>RJ<span><font color=3D"#888888"><br clear=3D"all"=
><div><br></div>-- <br>em <a href=3D"mailto:rnowling@gmail.com" target=3D"_b=
lank">rnowling@gmail.com</a><br>c <a href=3D"tel:954.496.2314" value=3D"+195=
44962314" target=3D"_blank">954.496.2314</a>
</font></span></div></div>
</blockquote></div><br><br clear=3D"all"><br>-- <br><div dir=3D"ltr"><div>Ch=
eers<br></div>-MJ<br></div>
</div>
</div></blockquote></div></div></div></blockquote></div><br><br clear=3D"all=
"><div><br></div>-- <br>em <a href=3D"mailto:rnowling@gmail.com" target=3D"_=
blank">rnowling@gmail.com</a><br>c <a href=3D"tel:954.496.2314" value=3D"+19=
544962314" target=3D"_blank">954.496.2314</a>
</div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div></div></=
div><span class=3D"HOEnZb"><font color=3D"#888888">-- <br>Alejandro
</font></span></div>
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>em <a href=3D=
"mailto:rnowling@gmail.com">rnowling@gmail.com</a><br>c 954.496.2314
</div>
</div></blockquote></div></blockquote></body></html>=

--Apple-Mail-8A5AF459-1016-4882-BECD-B39FD632B08E--