hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zesheng Wu <wuzeshen...@gmail.com>
Subject Re: Replace a block with a new one
Date Mon, 21 Jul 2014 12:35:27 GMT
 If a block is corrupted but hasn't been detected by HDFS, you could delete
the block from the local filesystem (it's only a file) then HDFS will
replicate the good remaining replica of this block.

We only have one replica for each block, if a block is corrupted, HDFS
cannot replicate it.


2014-07-21 20:30 GMT+08:00 Zesheng Wu <wuzesheng86@gmail.com>:

> Thanks Bertrand, my reply comments inline following.
>
> So you know that a block is corrupted thanks to an external process which
> in this case is checking the parity blocks. If a block is corrupted but
> hasn't been detected by HDFS, you could delete the block from the local
> filesystem (it's only a file) then HDFS will replicate the good remaining
> replica of this block.
>  *[Zesheng: We will implement a periodical checking mechanism to check
> the corrupted blocks.]*
>
> For performance reason (and that's what you want to do?), you might be
> able to fix the corruption without needing to retrieve the good replica. It
> might be possible by working directly with the local system by replacing
> the corrupted block by the corrected block (which again are files). On
> issue is that the corrected block might be different than the good replica.
> If HDFS is able to tell (with CRC) it might be good else you will end up
> with two different good replicas for the same block and that will not be
> pretty...
> *[Zesheng: Indeed we will use the reed-solomon erasure codes to implement
> the HDFS raid, so the corrupted block will be recovered from the back good
> data and coding blocks]*
>
> If the result is to be open source, you might want to check with Facebook
> about their implementation and track the process within Apache JIRA. You
> could gain additional feedbacks. One downside of HDFS RAID is that the less
> replicas there is, the less read of the data for processing will be
> 'efficient/fast'. Reducing the number of replicas also diminishes the
> number of supported node failures. I wouldn't say it's an easy ride.
> *[Zesheng: Yes, I agree with you that the read performance downgrade, but
> not with the number of supported node failures, reed-solomon algorithm can
> maintain equal or even higher node failures **tolerance. About open
> source, we will consider this in the future.**]*
>
>
> 2014-07-21 20:01 GMT+08:00 Bertrand Dechoux <dechouxb@gmail.com>:
>
> So you know that a block is corrupted thanks to an external process which
>> in this case is checking the parity blocks. If a block is corrupted but
>> hasn't been detected by HDFS, you could delete the block from the local
>> filesystem (it's only a file) then HDFS will replicate the good remaining
>> replica of this block.
>>
>> For performance reason (and that's what you want to do?), you might be
>> able to fix the corruption without needing to retrieve the good replica. It
>> might be possible by working directly with the local system by replacing
>> the corrupted block by the corrected block (which again are files). On
>> issue is that the corrected block might be different than the good replica.
>> If HDFS is able to tell (with CRC) it might be good else you will end up
>> with two different good replicas for the same block and that will not be
>> pretty...
>>
>> If the result is to be open source, you might want to check with Facebook
>> about their implementation and track the process within Apache JIRA. You
>> could gain additional feedbacks. One downside of HDFS RAID is that the less
>> replicas there is, the less read of the data for processing will be
>> 'efficient/fast'. Reducing the number of replicas also diminishes the
>> number of supported node failures. I wouldn't say it's an easy ride.
>>
>> Bertrand Dechoux
>>
>>
>> On Mon, Jul 21, 2014 at 1:29 PM, Zesheng Wu <wuzesheng86@gmail.com>
>> wrote:
>>
>>> We want to implement a RAID on top of HDFS, something like facebook
>>> implemented as described in:
>>> https://code.facebook.com/posts/536638663113101/saving-capacity-with-hdfs-raid/
>>>
>>>
>>> 2014-07-21 17:19 GMT+08:00 Bertrand Dechoux <dechouxb@gmail.com>:
>>>
>>> You want to implement a RAID on top of HDFS or use HDFS on top of RAID?
>>>> I am not sure I understand any of these use cases. HDFS handles for you
>>>> replication and error detection. Fine tuning the cluster wouldn't be the
>>>> easier solution?
>>>>
>>>> Bertrand Dechoux
>>>>
>>>>
>>>> On Mon, Jul 21, 2014 at 7:25 AM, Zesheng Wu <wuzesheng86@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks for reply, Arpit.
>>>>> Yes, we need to do this regularly. The original requirement of this is
>>>>> that we want to do RAID(which is based reed-solomon erasure codes) on
our
>>>>> HDFS cluster. When a block is corrupted or missing, the downgrade read
>>>>> needs quick recovery of the block. We are considering how to recovery
the
>>>>> corrupted/missing block quickly.
>>>>>
>>>>>
>>>>> 2014-07-19 5:18 GMT+08:00 Arpit Agarwal <aagarwal@hortonworks.com>:
>>>>>
>>>>>> IMHO this is a spectacularly bad idea. Is it a one off event? Why
not
>>>>>> just take the perf hit and recreate the file?
>>>>>>
>>>>>> If you need to do this regularly you should consider a mutable file
>>>>>> store like HBase. If you start modifying blocks from under HDFS you
open up
>>>>>> all sorts of consistency issues.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Jul 18, 2014 at 2:10 PM, Shumin Guo <gsmsteve@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> That will break the consistency of the file system, but it doesn't
>>>>>>> hurt to try.
>>>>>>>  On Jul 17, 2014 8:48 PM, "Zesheng Wu" <wuzesheng86@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> How about write a new block with new checksum file, and replace
the
>>>>>>>> old block file and checksum file both?
>>>>>>>>
>>>>>>>>
>>>>>>>> 2014-07-17 19:34 GMT+08:00 Wellington Chevreuil <
>>>>>>>> wellington.chevreuil@gmail.com>:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> there's no way to do that, as HDFS does not provide file
updates
>>>>>>>>> features. You'll need to write a new file with the changes.
>>>>>>>>>
>>>>>>>>> Notice that even if you manage to find the physical block
replica
>>>>>>>>> files on the disk, corresponding to the part of the file
you want to
>>>>>>>>> change, you can't simply update it manually, as this
would give a different
>>>>>>>>> checksum, making HDFS mark such blocks as corrupt.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Wellington.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 17 Jul 2014, at 10:50, Zesheng Wu <wuzesheng86@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> > Hi guys,
>>>>>>>>> >
>>>>>>>>> > I recently encounter a scenario which needs to replace
an exist
>>>>>>>>> block with a newly written block
>>>>>>>>> > The most straightforward way to finish may be like
this:
>>>>>>>>> > Suppose the original file is A, and we write a new
file B which
>>>>>>>>> is composed by the new data blocks, then we merge A and
B to C which is the
>>>>>>>>> file we wanted
>>>>>>>>> > The obvious shortcoming of this method is wasting
of network
>>>>>>>>> bandwidth
>>>>>>>>> >
>>>>>>>>> > I'm wondering whether there is a way to replace
the old block by
>>>>>>>>> the new block directly.
>>>>>>>>> > Any thoughts?
>>>>>>>>> >
>>>>>>>>> > --
>>>>>>>>> > Best Wishes!
>>>>>>>>> >
>>>>>>>>> > Yours, Zesheng
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best Wishes!
>>>>>>>>
>>>>>>>> Yours, Zesheng
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> CONFIDENTIALITY NOTICE
>>>>>> NOTICE: This message is intended for the use of the individual or
>>>>>> entity to which it is addressed and may contain information that
is
>>>>>> confidential, privileged and exempt from disclosure under applicable
law.
>>>>>> If the reader of this message is not the intended recipient, you
are hereby
>>>>>> notified that any printing, copying, dissemination, distribution,
>>>>>> disclosure or forwarding of this communication is strictly prohibited.
If
>>>>>> you have received this communication in error, please contact the
sender
>>>>>> immediately and delete it from your system. Thank You.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Wishes!
>>>>>
>>>>> Yours, Zesheng
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Wishes!
>>>
>>> Yours, Zesheng
>>>
>>
>>
>
>
> --
> Best Wishes!
>
> Yours, Zesheng
>



-- 
Best Wishes!

Yours, Zesheng

Mime
View raw message