hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bertrand Dechoux <decho...@gmail.com>
Subject Re: Replace a block with a new one
Date Mon, 21 Jul 2014 13:46:18 GMT
I wrote my answer thinking about the XOR implementation. With reed-solomon
and single replication, the cases that need to be considered are indeed
smaller, simpler.

It seems I was wrong about my last statement though. If the machine hosting
a single-replicated block is lost, it isn't likely that the block can't be
reconstructed from a summary of the data. But with a RAID6 strategy / RS,
its is indeed possible but of course up to a certain number of blocks.

There clearly is a case for such tool on a Hadoop cluster.

Bertrand Dechoux


On Mon, Jul 21, 2014 at 2:35 PM, Zesheng Wu <wuzesheng86@gmail.com> wrote:

>  If a block is corrupted but hasn't been detected by HDFS, you could
> delete the block from the local filesystem (it's only a file) then HDFS
> will replicate the good remaining replica of this block.
>
> We only have one replica for each block, if a block is corrupted, HDFS
> cannot replicate it.
>
>
> 2014-07-21 20:30 GMT+08:00 Zesheng Wu <wuzesheng86@gmail.com>:
>
> Thanks Bertrand, my reply comments inline following.
>>
>> So you know that a block is corrupted thanks to an external process which
>> in this case is checking the parity blocks. If a block is corrupted but
>> hasn't been detected by HDFS, you could delete the block from the local
>> filesystem (it's only a file) then HDFS will replicate the good remaining
>> replica of this block.
>>  *[Zesheng: We will implement a periodical checking mechanism to check
>> the corrupted blocks.]*
>>
>> For performance reason (and that's what you want to do?), you might be
>> able to fix the corruption without needing to retrieve the good replica. It
>> might be possible by working directly with the local system by replacing
>> the corrupted block by the corrected block (which again are files). On
>> issue is that the corrected block might be different than the good replica.
>> If HDFS is able to tell (with CRC) it might be good else you will end up
>> with two different good replicas for the same block and that will not be
>> pretty...
>> *[Zesheng: Indeed we will use the reed-solomon erasure codes to implement
>> the HDFS raid, so the corrupted block will be recovered from the back good
>> data and coding blocks]*
>>
>> If the result is to be open source, you might want to check with Facebook
>> about their implementation and track the process within Apache JIRA. You
>> could gain additional feedbacks. One downside of HDFS RAID is that the less
>> replicas there is, the less read of the data for processing will be
>> 'efficient/fast'. Reducing the number of replicas also diminishes the
>> number of supported node failures. I wouldn't say it's an easy ride.
>> *[Zesheng: Yes, I agree with you that the read performance downgrade, but
>> not with the number of supported node failures, reed-solomon algorithm can
>> maintain equal or even higher node failures **tolerance. About open
>> source, we will consider this in the future.**]*
>>
>>
>> 2014-07-21 20:01 GMT+08:00 Bertrand Dechoux <dechouxb@gmail.com>:
>>
>> So you know that a block is corrupted thanks to an external process which
>>> in this case is checking the parity blocks. If a block is corrupted but
>>> hasn't been detected by HDFS, you could delete the block from the local
>>> filesystem (it's only a file) then HDFS will replicate the good remaining
>>> replica of this block.
>>>
>>> For performance reason (and that's what you want to do?), you might be
>>> able to fix the corruption without needing to retrieve the good replica. It
>>> might be possible by working directly with the local system by replacing
>>> the corrupted block by the corrected block (which again are files). On
>>> issue is that the corrected block might be different than the good replica.
>>> If HDFS is able to tell (with CRC) it might be good else you will end up
>>> with two different good replicas for the same block and that will not be
>>> pretty...
>>>
>>> If the result is to be open source, you might want to check with
>>> Facebook about their implementation and track the process within Apache
>>> JIRA. You could gain additional feedbacks. One downside of HDFS RAID is
>>> that the less replicas there is, the less read of the data for processing
>>> will be 'efficient/fast'. Reducing the number of replicas also diminishes
>>> the number of supported node failures. I wouldn't say it's an easy ride.
>>>
>>> Bertrand Dechoux
>>>
>>>
>>> On Mon, Jul 21, 2014 at 1:29 PM, Zesheng Wu <wuzesheng86@gmail.com>
>>> wrote:
>>>
>>>> We want to implement a RAID on top of HDFS, something like facebook
>>>> implemented as described in:
>>>> https://code.facebook.com/posts/536638663113101/saving-capacity-with-hdfs-raid/
>>>>
>>>>
>>>> 2014-07-21 17:19 GMT+08:00 Bertrand Dechoux <dechouxb@gmail.com>:
>>>>
>>>> You want to implement a RAID on top of HDFS or use HDFS on top of RAID?
>>>>> I am not sure I understand any of these use cases. HDFS handles for you
>>>>> replication and error detection. Fine tuning the cluster wouldn't be
the
>>>>> easier solution?
>>>>>
>>>>> Bertrand Dechoux
>>>>>
>>>>>
>>>>> On Mon, Jul 21, 2014 at 7:25 AM, Zesheng Wu <wuzesheng86@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks for reply, Arpit.
>>>>>> Yes, we need to do this regularly. The original requirement of this
>>>>>> is that we want to do RAID(which is based reed-solomon erasure codes)
on
>>>>>> our HDFS cluster. When a block is corrupted or missing, the downgrade
read
>>>>>> needs quick recovery of the block. We are considering how to recovery
the
>>>>>> corrupted/missing block quickly.
>>>>>>
>>>>>>
>>>>>> 2014-07-19 5:18 GMT+08:00 Arpit Agarwal <aagarwal@hortonworks.com>:
>>>>>>
>>>>>>> IMHO this is a spectacularly bad idea. Is it a one off event?
Why
>>>>>>> not just take the perf hit and recreate the file?
>>>>>>>
>>>>>>> If you need to do this regularly you should consider a mutable
file
>>>>>>> store like HBase. If you start modifying blocks from under HDFS
you open up
>>>>>>> all sorts of consistency issues.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jul 18, 2014 at 2:10 PM, Shumin Guo <gsmsteve@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> That will break the consistency of the file system, but it
doesn't
>>>>>>>> hurt to try.
>>>>>>>>  On Jul 17, 2014 8:48 PM, "Zesheng Wu" <wuzesheng86@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> How about write a new block with new checksum file, and
replace
>>>>>>>>> the old block file and checksum file both?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2014-07-17 19:34 GMT+08:00 Wellington Chevreuil <
>>>>>>>>> wellington.chevreuil@gmail.com>:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> there's no way to do that, as HDFS does not provide
file updates
>>>>>>>>>> features. You'll need to write a new file with the
changes.
>>>>>>>>>>
>>>>>>>>>> Notice that even if you manage to find the physical
block replica
>>>>>>>>>> files on the disk, corresponding to the part of the
file you want to
>>>>>>>>>> change, you can't simply update it manually, as this
would give a different
>>>>>>>>>> checksum, making HDFS mark such blocks as corrupt.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Wellington.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 17 Jul 2014, at 10:50, Zesheng Wu <wuzesheng86@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> > Hi guys,
>>>>>>>>>> >
>>>>>>>>>> > I recently encounter a scenario which needs
to replace an exist
>>>>>>>>>> block with a newly written block
>>>>>>>>>> > The most straightforward way to finish may be
like this:
>>>>>>>>>> > Suppose the original file is A, and we write
a new file B which
>>>>>>>>>> is composed by the new data blocks, then we merge
A and B to C which is the
>>>>>>>>>> file we wanted
>>>>>>>>>> > The obvious shortcoming of this method is wasting
of network
>>>>>>>>>> bandwidth
>>>>>>>>>> >
>>>>>>>>>> > I'm wondering whether there is a way to replace
the old block
>>>>>>>>>> by the new block directly.
>>>>>>>>>> > Any thoughts?
>>>>>>>>>> >
>>>>>>>>>> > --
>>>>>>>>>> > Best Wishes!
>>>>>>>>>> >
>>>>>>>>>> > Yours, Zesheng
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Best Wishes!
>>>>>>>>>
>>>>>>>>> Yours, Zesheng
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>> NOTICE: This message is intended for the use of the individual
or
>>>>>>> entity to which it is addressed and may contain information that
is
>>>>>>> confidential, privileged and exempt from disclosure under applicable
law.
>>>>>>> If the reader of this message is not the intended recipient,
you are hereby
>>>>>>> notified that any printing, copying, dissemination, distribution,
>>>>>>> disclosure or forwarding of this communication is strictly prohibited.
If
>>>>>>> you have received this communication in error, please contact
the sender
>>>>>>> immediately and delete it from your system. Thank You.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Wishes!
>>>>>>
>>>>>> Yours, Zesheng
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Wishes!
>>>>
>>>> Yours, Zesheng
>>>>
>>>
>>>
>>
>>
>> --
>> Best Wishes!
>>
>> Yours, Zesheng
>>
>
>
>
> --
> Best Wishes!
>
> Yours, Zesheng
>

Mime
View raw message