Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of tucu@cloudera.com designates
 209.85.192.181 as permitted sender)
Subject: Re: Data Locality and WebHDFS
References: 
 <CADtDQQ+yB-GGZKZ66zjXTMw7rO4KFkbP7ZFSbBAqWQ+tKD4oKQ@mail.gmail.com>
 <CADQoAEh+Km2d1LJ6giFiYQ-+w72bT=8GhsiAey+AoL7n=hTFPA@mail.gmail.com>
 <E0F0FEF1-629B-4A56-B4D6-A7AFB09D87A8@cloudera.com>
 <CADtDQQK7bVXv9tgKn9Mph5yfRHgf+=ihd9oYgiNqcECpDJgKKA@mail.gmail.com>
 <CAJs-t7NY_AMDtOpjTw6UbP=MmH_H7uZ=kxmZqiiuvFqZZGdv-w@mail.gmail.com>
 <CADtDQQKJc6zP3iXoreG5+cjWYP0M8=EG+NXDydqmvp_b7cemBg@mail.gmail.com>
From: Alejandro Abdelnur <tucu@cloudera.com>
Content-Type: multipart/alternative;
	boundary=Apple-Mail-D3354A3E-4F25-403C-BE90-135E1F8DE4AD
In-Reply-To: 
 <CADtDQQKJc6zP3iXoreG5+cjWYP0M8=EG+NXDydqmvp_b7cemBg@mail.gmail.com>
Message-Id: <56681234-BFCC-4293-A9D9-05547F53F9B3@cloudera.com>
Date: Mon, 17 Mar 2014 10:07:31 -0700
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (1.0)


--Apple-Mail-D3354A3E-4F25-403C-BE90-135E1F8DE4AD
Content-Type: text/plain;
	charset=utf-8
Content-Transfer-Encoding: quoted-printable

dont recall how skips are handled in webhdfs, but i would assume that you'll=
 get to the first block As usual, and the skip is handled by the DN serving t=
he file (as webhdfs doesnot know at open that you'll skip)

Alejandro
(phone typing)

> On Mar 17, 2014, at 9:47, RJ Nowling <rnowling@gmail.com> wrote:
>=20
> Hi Alejandro,
>=20
> The WebHDFS API allows specifying an offset and length for the request.  I=
f I specify an offset that start in the second block for a file (thus skippi=
ng the first block all together), will the namenode still direct me to a dat=
anode with the first block or will it direct me to a namenode with the secon=
d block?  I.e., am I assured data locality only on the first block of the fi=
le (as you're saying) or on the first block I am accessing?
>=20
> If it is as you say, then I may want to reach out the WebHDFS developers a=
nd see if they would be interested in the additional functionality.
>=20
> Thank you,
> RJ
>=20
>=20
>> On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur <tucu@cloudera.com> w=
rote:
>> I may have expressed myself wrong. You don't need to do any test to see h=
ow locality works with files of multiple blocks. If you are accessing a file=
 of more than one block over webhdfs, you only have assured locality for the=
 first block of the file.
>>=20
>> Thanks.
>>=20
>>=20
>>> On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <rnowling@gmail.com> wrote:
>>> Thank you, Mingjiang and Alejandro.
>>>=20
>>> This is interesting.  Since we will use the data locality information fo=
r scheduling, we could "hack" this to get the data locality information, at l=
east for the first block.  As Alejandro says, we'd have to test what happens=
 for other data blocks -- e.g., what if, knowing the block sizes, we request=
 the second or third block?
>>>=20
>>> Interesting food for thought!  I see some experiments in my future! =20
>>>=20
>>> Thanks!
>>>=20
>>>=20
>>>> On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <tucu@cloudera.com=
> wrote:
>>>> well, this is for the first block of the file, the rest of the file (bl=
ocks being local or not) are streamed out by the same datanode. for small fi=
les (one block) you'll get locality, for large files only the first block, a=
nd by chance if other blocks are local to that datanode.=20
>>>>=20
>>>>=20
>>>> Alejandro
>>>> (phone typing)
>>>>=20
>>>>> On Mar 16, 2014, at 18:53, Mingjiang Shi <mshi@gopivotal.com> wrote:
>>>>>=20
>>>>> According to this page: http://hortonworks.com/blog/webhdfs-%E2%80%93-=
http-rest-access-to-hdfs/
>>>>>> Data Locality: The file read and file write calls are redirected to t=
he corresponding datanodes. It uses the full bandwidth of the Hadoop cluster=
 for streaming data.
>>>>>>=20
>>>>>> A HDFS Built-in Component: WebHDFS is a first class built-in componen=
t of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all=
 HDFS functionalities. It is a part of HDFS =E2=80=93 there are no additiona=
l servers to install
>>>>>>=20
>>>>>=20
>>>>> So it looks like the data locality is built-into webhdfs, client will b=
e redirected to the data node automatically.=20
>>>>>=20
>>>>>=20
>>>>>=20
>>>>>=20
>>>>>> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rnowling@gmail.com> wrot=
e:
>>>>>> Hi all,
>>>>>>=20
>>>>>> I'm writing up a Google Summer of Code proposal to add HDFS support t=
o Disco, an Erlang MapReduce framework. =20
>>>>>>=20
>>>>>> We're interested in using WebHDFS.  I have two questions:
>>>>>>=20
>>>>>> 1) Does WebHDFS allow querying data locality information?
>>>>>>=20
>>>>>> 2) If the data locality information is known, can data on specific da=
ta nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go t=
hrough a single server?
>>>>>>=20
>>>>>> Thanks,
>>>>>> RJ
>>>>>>=20
>>>>>> --=20
>>>>>> em rnowling@gmail.com
>>>>>> c 954.496.2314
>>>>>=20
>>>>>=20
>>>>>=20
>>>>> --=20
>>>>> Cheers
>>>>> -MJ
>>>=20
>>>=20
>>>=20
>>> --=20
>>> em rnowling@gmail.com
>>> c 954.496.2314
>>=20
>>=20
>>=20
>> --=20
>> Alejandro
>=20
>=20
>=20
> --=20
> em rnowling@gmail.com
> c 954.496.2314

--Apple-Mail-D3354A3E-4F25-403C-BE90-135E1F8DE4AD
Content-Type: text/html;
	charset=utf-8
Content-Transfer-Encoding: quoted-printable

<html><head><meta http-equiv=3D"content-type" content=3D"text/html; charset=3D=
utf-8"></head><body dir=3D"auto"><div><span style=3D"-webkit-text-size-adjus=
t: auto;">dont recall how skips are handled in webhdfs, but i would assume t=
hat you'll get to the first block As usual, and the skip is handled by the D=
N serving the file (as webhdfs doesnot know at open that you'll skip)</span>=
</div><div><br><span style=3D"-webkit-text-size-adjust: auto;">Alejandro</sp=
an><div style=3D"-webkit-text-size-adjust: auto;">(phone typing)</div></div>=
<div style=3D"-webkit-text-size-adjust: auto;"><br>On Mar 17, 2014, at 9:47,=
 RJ Nowling &lt;<a href=3D"mailto:rnowling@gmail.com">rnowling@gmail.com</a>=
&gt; wrote:<br><br></div><blockquote type=3D"cite" style=3D"-webkit-text-siz=
e-adjust: auto;"><div><div dir=3D"ltr">Hi Alejandro,<div><br></div><div>The W=
ebHDFS API allows specifying an offset and length for the request. &nbsp;If I=
 specify an offset that start in the second block for a file (thus skipping t=
he first block all together), will the namenode still direct me to a datanod=
e with the first block or will it direct me to a namenode with the second bl=
ock? &nbsp;I.e., am I assured data locality only on the first block of the f=
ile (as you're saying) or on the first block I am accessing?</div>
<div><br></div><div>If it is as you say, then I may want to reach out the We=
bHDFS developers and see if they would be interested in the additional funct=
ionality.</div><div><br></div><div>Thank you,</div><div>RJ</div></div>
<div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Mon, Mar 17=
, 2014 at 2:40 AM, Alejandro Abdelnur <span dir=3D"ltr">&lt;<a href=3D"mailt=
o:tucu@cloudera.com" target=3D"_blank">tucu@cloudera.com</a>&gt;</span> wrot=
e:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px=
 #ccc solid;padding-left:1ex"><div dir=3D"ltr">I may have expressed myself w=
rong. You don't need to do any test to see how locality works with files of m=
ultiple blocks. If you are accessing a file of more than one block over webh=
dfs, you only have assured locality for the first block of the file.<div>


<br></div><div>Thanks.</div></div><div class=3D"gmail_extra"><div><div class=
=3D"h5"><br><br><div class=3D"gmail_quote">On Sun, Mar 16, 2014 at 9:18 PM, R=
J Nowling <span dir=3D"ltr">&lt;<a href=3D"mailto:rnowling@gmail.com" target=
=3D"_blank">rnowling@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px=
 #ccc solid;padding-left:1ex"><div dir=3D"ltr">Thank you, Mingjiang and Alej=
andro.<div><br></div><div>This is interesting. &nbsp;Since we will use the d=
ata locality information for scheduling, we could "hack" this to get the dat=
a locality information, at least for the first block. &nbsp;As Alejandro say=
s, we'd have to test what happens for other data blocks -- e.g., what if, kn=
owing the block sizes, we request the second or third block?</div>


<div><br></div><div>Interesting food for thought! &nbsp;I see some experimen=
ts in my future! &nbsp;</div><div><br></div><div>Thanks!</div></div><div><di=
v><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">

On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <span dir=3D"ltr">&lt;<=
a href=3D"mailto:tucu@cloudera.com" target=3D"_blank">tucu@cloudera.com</a>&=
gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px=
 #ccc solid;padding-left:1ex"><div dir=3D"auto"><div>well, this is for the f=
irst block of the file, the rest of the file (blocks being local or not) are=
 streamed out by the same datanode. for small files (one block) you'll get l=
ocality, for large files only the first block, and by chance if other blocks=
 are local to that datanode.&nbsp;</div>


<div><br></div><div><br>Alejandro<div>(phone typing)</div></div><div><div><d=
iv><br>On Mar 16, 2014, at 18:53, Mingjiang Shi &lt;<a href=3D"mailto:mshi@g=
opivotal.com" target=3D"_blank">mshi@gopivotal.com</a>&gt; wrote:<br>
<br></div><blockquote type=3D"cite"><div><div dir=3D"ltr"><div class=3D"gmai=
l_default" style=3D"font-size:small">According to this page: <a href=3D"http=
://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/" target=3D=
"_blank">http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-h=
dfs/</a><br>


<blockquote style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,=
204,204);padding-left:1ex" class=3D"gmail_quote"><p><strong>Data Locality</s=
trong>: The file read and file write calls=20
are redirected to the corresponding datanodes. It uses the full=20
bandwidth of the Hadoop cluster for streaming data.</p>
<p><strong>A HDFS Built-in Component</strong>: WebHDFS is a first class=20
built-in component of HDFS. It runs inside Namenodes and Datanodes,=20
therefore, it can use all HDFS functionalities. It is a part of HDFS =E2=80=93=
=20
there are no additional servers to install</p></blockquote><br></div><div cl=
ass=3D"gmail_default" style=3D"font-size:small">So it looks like the data lo=
cality is built-into webhdfs, client will be redirected to the data node aut=
omatically. <br>


<br><br></div></div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_q=
uote">On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <span dir=3D"ltr">&lt;<a h=
ref=3D"mailto:rnowling@gmail.com" target=3D"_blank">rnowling@gmail.com</a>&g=
t;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px=
 #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi all,<div><br></div><div>I'=
m writing up a Google Summer of Code proposal to add HDFS support to Disco, a=
n Erlang MapReduce framework. &nbsp;</div>


<div><br></div><div>We're interested in using WebHDFS. &nbsp;I have two ques=
tions:</div>
<div><br></div><div>1) Does WebHDFS allow querying data locality information=
?</div><div><br></div><div>2) If the data locality information is known, can=
 data on specific data nodes be accessed via Web HDFS? &nbsp;Or do all Web H=
DFS requests have to go through a single server?</div>


<div><br>Thanks,</div><div>RJ<span><font color=3D"#888888"><br clear=3D"all"=
><div><br></div>-- <br>em <a href=3D"mailto:rnowling@gmail.com" target=3D"_b=
lank">rnowling@gmail.com</a><br>c <a href=3D"tel:954.496.2314" value=3D"+195=
44962314" target=3D"_blank">954.496.2314</a>
</font></span></div></div>
</blockquote></div><br><br clear=3D"all"><br>-- <br><div dir=3D"ltr"><div>Ch=
eers<br></div>-MJ<br></div>
</div>
</div></blockquote></div></div></div></blockquote></div><br><br clear=3D"all=
"><div><br></div>-- <br>em <a href=3D"mailto:rnowling@gmail.com" target=3D"_=
blank">rnowling@gmail.com</a><br>c <a href=3D"tel:954.496.2314" value=3D"+19=
544962314" target=3D"_blank">954.496.2314</a>
</div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div></div></=
div><span class=3D"HOEnZb"><font color=3D"#888888">-- <br>Alejandro
</font></span></div>
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>em <a href=3D=
"mailto:rnowling@gmail.com">rnowling@gmail.com</a><br>c 954.496.2314
</div>
</div></blockquote></body></html>=

--Apple-Mail-D3354A3E-4F25-403C-BE90-135E1F8DE4AD--