Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of dechouxb@gmail.com designates
 209.85.215.47 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAOSP=C1-_eVLeiKgnsW1aqh+j=wTM1UJQ-t8OQgJmiKRs7TqMw@mail.gmail.com>
References: 
 <CAOSP=C0AXx1_sBwzm8XW+_4aMEf6YN+jwYWsYZGwCb9BCJRuqA@mail.gmail.com>
	<EDFC08F0-C2F8-4E7B-82D1-35DCD671776B@gmail.com>
	<CAOSP=C2TMbUsBBHjAHa+6imwt1oOP48A=B-jZzdRcT29338apw@mail.gmail.com>
	<CAFA52KSHAJUXuoe6Lmb9iGw0h8U_jo4WmKmkyA74ROMP1jAMyQ@mail.gmail.com>
	<CAFiYk=ruzFc63QHafg0JduJUu1aFqNZ376QXfyThuffa_xtMhQ@mail.gmail.com>
	<CAOSP=C0MckQ1dCP0YKHou0Q1mQ5pU9wdqxdUcBvj6nVwvP9+TQ@mail.gmail.com>
	<CAO6W-2fp=zMd+AB5E2QYzVd3ONSSPkFnQZb3g4jJ69zm_0V1TA@mail.gmail.com>
	<CAOSP=C1-_eVLeiKgnsW1aqh+j=wTM1UJQ-t8OQgJmiKRs7TqMw@mail.gmail.com>
Date: Mon, 21 Jul 2014 14:01:03 +0200
Message-ID: 
 <CAO6W-2cVakXbcUz7fkvaNLFnyL6MZ7t7wFX281uy7agZKkzrig@mail.gmail.com>
Subject: Re: Replace a block with a new one
From: Bertrand Dechoux <dechouxb@gmail.com>
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=001a11c33fcc7a006b04feb2dcef

--001a11c33fcc7a006b04feb2dcef
Content-Type: text/plain; charset=UTF-8

So you know that a block is corrupted thanks to an external process which
in this case is checking the parity blocks. If a block is corrupted but
hasn't been detected by HDFS, you could delete the block from the local
filesystem (it's only a file) then HDFS will replicate the good remaining
replica of this block.

For performance reason (and that's what you want to do?), you might be able
to fix the corruption without needing to retrieve the good replica. It
might be possible by working directly with the local system by replacing
the corrupted block by the corrected block (which again are files). On
issue is that the corrected block might be different than the good replica.
If HDFS is able to tell (with CRC) it might be good else you will end up
with two different good replicas for the same block and that will not be
pretty...

If the result is to be open source, you might want to check with Facebook
about their implementation and track the process within Apache JIRA. You
could gain additional feedbacks. One downside of HDFS RAID is that the less
replicas there is, the less read of the data for processing will be
'efficient/fast'. Reducing the number of replicas also diminishes the
number of supported node failures. I wouldn't say it's an easy ride.

Bertrand Dechoux


On Mon, Jul 21, 2014 at 1:29 PM, Zesheng Wu <wuzesheng86@gmail.com> wrote:

> We want to implement a RAID on top of HDFS, something like facebook
> implemented as described in:
> https://code.facebook.com/posts/536638663113101/saving-capacity-with-hdfs-raid/
>
>
> 2014-07-21 17:19 GMT+08:00 Bertrand Dechoux <dechouxb@gmail.com>:
>
> You want to implement a RAID on top of HDFS or use HDFS on top of RAID? I
>> am not sure I understand any of these use cases. HDFS handles for you
>> replication and error detection. Fine tuning the cluster wouldn't be the
>> easier solution?
>>
>> Bertrand Dechoux
>>
>>
>> On Mon, Jul 21, 2014 at 7:25 AM, Zesheng Wu <wuzesheng86@gmail.com>
>> wrote:
>>
>>> Thanks for reply, Arpit.
>>> Yes, we need to do this regularly. The original requirement of this is
>>> that we want to do RAID(which is based reed-solomon erasure codes) on our
>>> HDFS cluster. When a block is corrupted or missing, the downgrade read
>>> needs quick recovery of the block. We are considering how to recovery the
>>> corrupted/missing block quickly.
>>>
>>>
>>> 2014-07-19 5:18 GMT+08:00 Arpit Agarwal <aagarwal@hortonworks.com>:
>>>
>>>> IMHO this is a spectacularly bad idea. Is it a one off event? Why not
>>>> just take the perf hit and recreate the file?
>>>>
>>>> If you need to do this regularly you should consider a mutable file
>>>> store like HBase. If you start modifying blocks from under HDFS you open up
>>>> all sorts of consistency issues.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Jul 18, 2014 at 2:10 PM, Shumin Guo <gsmsteve@gmail.com> wrote:
>>>>
>>>>> That will break the consistency of the file system, but it doesn't
>>>>> hurt to try.
>>>>>  On Jul 17, 2014 8:48 PM, "Zesheng Wu" <wuzesheng86@gmail.com> wrote:
>>>>>
>>>>>> How about write a new block with new checksum file, and replace the
>>>>>> old block file and checksum file both?
>>>>>>
>>>>>>
>>>>>> 2014-07-17 19:34 GMT+08:00 Wellington Chevreuil <
>>>>>> wellington.chevreuil@gmail.com>:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> there's no way to do that, as HDFS does not provide file updates
>>>>>>> features. You'll need to write a new file with the changes.
>>>>>>>
>>>>>>> Notice that even if you manage to find the physical block replica
>>>>>>> files on the disk, corresponding to the part of the file you want to
>>>>>>> change, you can't simply update it manually, as this would give a different
>>>>>>> checksum, making HDFS mark such blocks as corrupt.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Wellington.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 17 Jul 2014, at 10:50, Zesheng Wu <wuzesheng86@gmail.com> wrote:
>>>>>>>
>>>>>>> > Hi guys,
>>>>>>> >
>>>>>>> > I recently encounter a scenario which needs to replace an exist
>>>>>>> block with a newly written block
>>>>>>> > The most straightforward way to finish may be like this:
>>>>>>> > Suppose the original file is A, and we write a new file B which is
>>>>>>> composed by the new data blocks, then we merge A and B to C which is the
>>>>>>> file we wanted
>>>>>>> > The obvious shortcoming of this method is wasting of network
>>>>>>> bandwidth
>>>>>>> >
>>>>>>> > I'm wondering whether there is a way to replace the old block by
>>>>>>> the new block directly.
>>>>>>> > Any thoughts?
>>>>>>> >
>>>>>>> > --
>>>>>>> > Best Wishes!
>>>>>>> >
>>>>>>> > Yours, Zesheng
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Wishes!
>>>>>>
>>>>>> Yours, Zesheng
>>>>>>
>>>>>
>>>>
>>>> CONFIDENTIALITY NOTICE
>>>> NOTICE: This message is intended for the use of the individual or
>>>> entity to which it is addressed and may contain information that is
>>>> confidential, privileged and exempt from disclosure under applicable law.
>>>> If the reader of this message is not the intended recipient, you are hereby
>>>> notified that any printing, copying, dissemination, distribution,
>>>> disclosure or forwarding of this communication is strictly prohibited. If
>>>> you have received this communication in error, please contact the sender
>>>> immediately and delete it from your system. Thank You.
>>>
>>>
>>>
>>>
>>> --
>>> Best Wishes!
>>>
>>> Yours, Zesheng
>>>
>>
>>
>
>
> --
> Best Wishes!
>
> Yours, Zesheng
>

--001a11c33fcc7a006b04feb2dcef
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">So you know that a block is corrupted thanks to an externa=
l process which in this case is checking the parity blocks. If a block is c=
orrupted but hasn&#39;t been detected by HDFS, you could delete the block f=
rom the local filesystem (it&#39;s only a file) then HDFS will replicate th=
e good remaining replica of this block.<div>
<br></div><div>For performance reason (and that&#39;s what you want to do?)=
, you might be able to fix the corruption without needing to retrieve the g=
ood replica. It might be possible by working directly with the local system=
 by replacing the corrupted block by the corrected block (which again are f=
iles). On issue is that the corrected block might be different than the goo=
d replica. If HDFS is able to tell (with CRC) it might be good else you wil=
l end up with two different good replicas for the same block and that will =
not be pretty...</div>
<div><br></div><div>If the result is to be open source, you might want to c=
heck with Facebook about their implementation and track the process within =
Apache JIRA. You could gain additional feedbacks. One downside of HDFS RAID=
 is that the less replicas there is, the less read of the data for processi=
ng will be &#39;efficient/fast&#39;. Reducing the number of replicas also d=
iminishes the number of supported node failures. I wouldn&#39;t say it&#39;=
s an easy ride.</div>
<div class=3D"gmail_extra"><br clear=3D"all"><div>Bertrand Dechoux</div>
<br><br><div class=3D"gmail_quote">On Mon, Jul 21, 2014 at 1:29 PM, Zesheng=
 Wu <span dir=3D"ltr">&lt;<a href=3D"mailto:wuzesheng86@gmail.com" target=
=3D"_blank">wuzesheng86@gmail.com</a>&gt;</span> wrote:<br><blockquote clas=
s=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;pad=
ding-left:1ex">
<div dir=3D"ltr">We want to implement a RAID on top of HDFS, something like=
 facebook implemented as described in:=C2=A0<a href=3D"https://code.faceboo=
k.com/posts/536638663113101/saving-capacity-with-hdfs-raid/" target=3D"_bla=
nk">https://code.facebook.com/posts/536638663113101/saving-capacity-with-hd=
fs-raid/</a></div>


<div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">2014-07-21 17=
:19 GMT+08:00 Bertrand Dechoux <span dir=3D"ltr">&lt;<a href=3D"mailto:dech=
ouxb@gmail.com" target=3D"_blank">dechouxb@gmail.com</a>&gt;</span>:<div><d=
iv class=3D"h5">
<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-lef=
t:1px #ccc solid;padding-left:1ex">

<div dir=3D"ltr">You want to implement a RAID on top of HDFS or use HDFS on=
 top of RAID? I am not sure I understand any of these use cases. HDFS handl=
es for you replication and error detection. Fine tuning the cluster wouldn&=
#39;t be the easier solution?=C2=A0<div class=3D"gmail_extra">


<span><font color=3D"#888888">
<br clear=3D"all"><div>Bertrand Dechoux</div></font></span><div><div>
<br><br><div class=3D"gmail_quote">On Mon, Jul 21, 2014 at 7:25 AM, Zesheng=
 Wu <span dir=3D"ltr">&lt;<a href=3D"mailto:wuzesheng86@gmail.com" target=
=3D"_blank">wuzesheng86@gmail.com</a>&gt;</span> wrote:<br><blockquote clas=
s=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;pad=
ding-left:1ex">


<div dir=3D"ltr">Thanks for reply, Arpit.<div>Yes, we need to do this regul=
arly. The original requirement of this is that we want to do RAID(which is =
based reed-solomon erasure codes) on our HDFS cluster. When a block is corr=
upted or missing, the downgrade read needs quick recovery of the block. We =
are considering how to recovery the corrupted/missing block quickly.</div>


</div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote"><a href=
=3D"tel:2014-07-19" value=3D"+85220140719" target=3D"_blank">2014-07-19</a>=
 5:18 GMT+08:00 Arpit Agarwal <span dir=3D"ltr">&lt;<a href=3D"mailto:aagar=
wal@hortonworks.com" target=3D"_blank">aagarwal@hortonworks.com</a>&gt;</sp=
an>:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div>IMHO this is a spectac=
ularly bad idea. Is it a one off event? Why not just take the perf hit and =
recreate the file?<br>


<br></div>If you need to do this regularly you should consider a mutable fi=
le store like HBase. If you start modifying blocks from under HDFS you open=
 up all sorts of consistency issues.<br>
<div><br><br></div></div><div><div><div class=3D"gmail_extra"><br><br><div =
class=3D"gmail_quote">On Fri, Jul 18, 2014 at 2:10 PM, Shumin Guo <span dir=
=3D"ltr">&lt;<a href=3D"mailto:gsmsteve@gmail.com" target=3D"_blank">gsmste=
ve@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><p dir=3D"ltr">That will break the consisten=
cy of the file system, but it doesn&#39;t hurt to try.</p><div>
<div>
<div class=3D"gmail_quote">On Jul 17, 2014 8:48 PM, &quot;Zesheng Wu&quot; =
&lt;<a href=3D"mailto:wuzesheng86@gmail.com" target=3D"_blank">wuzesheng86@=
gmail.com</a>&gt; wrote:<br type=3D"attribution"><blockquote class=3D"gmail=
_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:=
1ex">


<div dir=3D"ltr">How about write a new block with new checksum file, and re=
place the old block file and checksum file both?</div><div class=3D"gmail_e=
xtra"><br><br><div class=3D"gmail_quote">2014-07-17 19:34 GMT+08:00 Welling=
ton Chevreuil <span dir=3D"ltr">&lt;<a href=3D"mailto:wellington.chevreuil@=
gmail.com" target=3D"_blank">wellington.chevreuil@gmail.com</a>&gt;</span>:=
<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hi,<br>
<br>
there&#39;s no way to do that, as HDFS does not provide file updates featur=
es. You&#39;ll need to write a new file with the changes.<br>
<br>
Notice that even if you manage to find the physical block replica files on =
the disk, corresponding to the part of the file you want to change, you can=
&#39;t simply update it manually, as this would give a different checksum, =
making HDFS mark such blocks as corrupt.<br>


<br>
Regards,<br>
Wellington.<br>
<div><div><br>
<br>
<br>
On 17 Jul 2014, at 10:50, Zesheng Wu &lt;<a href=3D"mailto:wuzesheng86@gmai=
l.com" target=3D"_blank">wuzesheng86@gmail.com</a>&gt; wrote:<br>
<br>
&gt; Hi guys,<br>
&gt;<br>
&gt; I recently encounter a scenario which needs to replace an exist block =
with a newly written block<br>
&gt; The most straightforward way to finish may be like this:<br>
&gt; Suppose the original file is A, and we write a new file B which is com=
posed by the new data blocks, then we merge A and B to C which is the file =
we wanted<br>
&gt; The obvious shortcoming of this method is wasting of network bandwidth=
<br>
&gt;<br>
&gt; I&#39;m wondering whether there is a way to replace the old block by t=
he new block directly.<br>
&gt; Any thoughts?<br>
&gt;<br>
&gt; --<br>
&gt; Best Wishes!<br>
&gt;<br>
&gt; Yours, Zesheng<br>
<br>
</div></div></blockquote></div><br><br clear=3D"all"><span><font color=3D"#=
888888"><div><br></div>-- <br>Best Wishes!<br><br>Yours, Zesheng
</font></span></div><span><font color=3D"#888888">
</font></span></blockquote></div><span><font color=3D"#888888">
</font></span></div></div></blockquote></div><span><font color=3D"#888888">=
<br></font></span></div><span><font color=3D"#888888">

<br>
</font></span></div></div><span><font color=3D"#888888"><span style=3D"colo=
r:rgb(128,128,128);font-family:Arial,sans-serif;font-size:10px">CONFIDENTIA=
LITY NOTICE</span><br style=3D"color:rgb(128,128,128);font-family:Arial,san=
s-serif;font-size:10px">


<span style=3D"color:rgb(128,128,128);font-family:Arial,sans-serif;font-siz=
e:10px">NOTICE: This message is intended for the use of the individual or e=
ntity to which it is addressed and may contain information that is confiden=
tial, privileged and exempt from disclosure under applicable law. If the re=
ader of this message is not the intended recipient, you are hereby notified=
 that any printing, copying, dissemination, distribution, disclosure or for=
warding of this communication is strictly prohibited. If you have received =
this communication in error, please contact the sender immediately and dele=
te it from your system. Thank You.</span></font></span></blockquote>


<span><font color=3D"#888888">

</font></span></div><span><font color=3D"#888888"><br><br clear=3D"all"><di=
v><br></div>-- <br>Best Wishes!<br><br>Yours, Zesheng
</font></span></div>
</blockquote></div><br></div></div></div></div>
</blockquote></div></div></div><div><div class=3D"h5"><br><br clear=3D"all"=
><div><br></div>-- <br>Best Wishes!<br><br>Yours, Zesheng
</div></div></div>
</blockquote></div><br></div></div>

--001a11c33fcc7a006b04feb2dcef--