hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zesheng Wu <wuzeshen...@gmail.com>
Subject Re: Replace a block with a new one
Date Tue, 22 Jul 2014 01:31:58 GMT
Thank Bertrand, I've checked these information earlier. There's only XOR
implementation, and missed blocks are reconstructed by creating new files.



2014-07-22 3:47 GMT+08:00 Bertrand Dechoux <dechouxb@gmail.com>:

> And there is actually quite a lot of information about it.
>
>
> https://github.com/facebook/hadoop-20/blob/master/src/contrib/raid/src/java/org/apache/hadoop/hdfs/DistributedRaidFileSystem.java
>
> http://wiki.apache.org/hadoop/HDFS-RAID
>
>
> https://issues.apache.org/jira/browse/MAPREDUCE/component/12313416/?selectedTab=com.atlassian.jira.jira-projects-plugin:component-summary-panel
>
> Bertrand Dechoux
>
>
> On Mon, Jul 21, 2014 at 3:46 PM, Bertrand Dechoux <dechouxb@gmail.com>
> wrote:
>
>> I wrote my answer thinking about the XOR implementation. With
>> reed-solomon and single replication, the cases that need to be considered
>> are indeed smaller, simpler.
>>
>> It seems I was wrong about my last statement though. If the machine
>> hosting a single-replicated block is lost, it isn't likely that the block
>> can't be reconstructed from a summary of the data. But with a RAID6
>> strategy / RS, its is indeed possible but of course up to a certain number
>> of blocks.
>>
>> There clearly is a case for such tool on a Hadoop cluster.
>>
>> Bertrand Dechoux
>>
>>
>> On Mon, Jul 21, 2014 at 2:35 PM, Zesheng Wu <wuzesheng86@gmail.com>
>> wrote:
>>
>>>  If a block is corrupted but hasn't been detected by HDFS, you could
>>> delete the block from the local filesystem (it's only a file) then HDFS
>>> will replicate the good remaining replica of this block.
>>>
>>> We only have one replica for each block, if a block is corrupted, HDFS
>>> cannot replicate it.
>>>
>>>
>>> 2014-07-21 20:30 GMT+08:00 Zesheng Wu <wuzesheng86@gmail.com>:
>>>
>>> Thanks Bertrand, my reply comments inline following.
>>>>
>>>> So you know that a block is corrupted thanks to an external process
>>>> which in this case is checking the parity blocks. If a block is corrupted
>>>> but hasn't been detected by HDFS, you could delete the block from the local
>>>> filesystem (it's only a file) then HDFS will replicate the good remaining
>>>> replica of this block.
>>>>  *[Zesheng: We will implement a periodical checking mechanism to check
>>>> the corrupted blocks.]*
>>>>
>>>> For performance reason (and that's what you want to do?), you might be
>>>> able to fix the corruption without needing to retrieve the good replica.
It
>>>> might be possible by working directly with the local system by replacing
>>>> the corrupted block by the corrected block (which again are files). On
>>>> issue is that the corrected block might be different than the good replica.
>>>> If HDFS is able to tell (with CRC) it might be good else you will end up
>>>> with two different good replicas for the same block and that will not be
>>>> pretty...
>>>> *[Zesheng: Indeed we will use the reed-solomon erasure codes to
>>>> implement the HDFS raid, so the corrupted block will be recovered from the
>>>> back good data and coding blocks]*
>>>>
>>>> If the result is to be open source, you might want to check with
>>>> Facebook about their implementation and track the process within Apache
>>>> JIRA. You could gain additional feedbacks. One downside of HDFS RAID is
>>>> that the less replicas there is, the less read of the data for processing
>>>> will be 'efficient/fast'. Reducing the number of replicas also diminishes
>>>> the number of supported node failures. I wouldn't say it's an easy ride.
>>>> *[Zesheng: Yes, I agree with you that the read performance downgrade,
>>>> but not with the number of supported node failures, reed-solomon algorithm
>>>> can maintain equal or even higher node failures **tolerance. About
>>>> open source, we will consider this in the future.**]*
>>>>
>>>>
>>>> 2014-07-21 20:01 GMT+08:00 Bertrand Dechoux <dechouxb@gmail.com>:
>>>>
>>>> So you know that a block is corrupted thanks to an external process
>>>>> which in this case is checking the parity blocks. If a block is corrupted
>>>>> but hasn't been detected by HDFS, you could delete the block from the
local
>>>>> filesystem (it's only a file) then HDFS will replicate the good remaining
>>>>> replica of this block.
>>>>>
>>>>> For performance reason (and that's what you want to do?), you might be
>>>>> able to fix the corruption without needing to retrieve the good replica.
It
>>>>> might be possible by working directly with the local system by replacing
>>>>> the corrupted block by the corrected block (which again are files). On
>>>>> issue is that the corrected block might be different than the good replica.
>>>>> If HDFS is able to tell (with CRC) it might be good else you will end
up
>>>>> with two different good replicas for the same block and that will not
be
>>>>> pretty...
>>>>>
>>>>> If the result is to be open source, you might want to check with
>>>>> Facebook about their implementation and track the process within Apache
>>>>> JIRA. You could gain additional feedbacks. One downside of HDFS RAID
is
>>>>> that the less replicas there is, the less read of the data for processing
>>>>> will be 'efficient/fast'. Reducing the number of replicas also diminishes
>>>>> the number of supported node failures. I wouldn't say it's an easy ride.
>>>>>
>>>>> Bertrand Dechoux
>>>>>
>>>>>
>>>>> On Mon, Jul 21, 2014 at 1:29 PM, Zesheng Wu <wuzesheng86@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> We want to implement a RAID on top of HDFS, something like facebook
>>>>>> implemented as described in:
>>>>>> https://code.facebook.com/posts/536638663113101/saving-capacity-with-hdfs-raid/
>>>>>>
>>>>>>
>>>>>> 2014-07-21 17:19 GMT+08:00 Bertrand Dechoux <dechouxb@gmail.com>:
>>>>>>
>>>>>> You want to implement a RAID on top of HDFS or use HDFS on top of
>>>>>>> RAID? I am not sure I understand any of these use cases. HDFS
handles for
>>>>>>> you replication and error detection. Fine tuning the cluster
wouldn't be
>>>>>>> the easier solution?
>>>>>>>
>>>>>>> Bertrand Dechoux
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jul 21, 2014 at 7:25 AM, Zesheng Wu <wuzesheng86@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks for reply, Arpit.
>>>>>>>> Yes, we need to do this regularly. The original requirement
of this
>>>>>>>> is that we want to do RAID(which is based reed-solomon erasure
codes) on
>>>>>>>> our HDFS cluster. When a block is corrupted or missing, the
downgrade read
>>>>>>>> needs quick recovery of the block. We are considering how
to recovery the
>>>>>>>> corrupted/missing block quickly.
>>>>>>>>
>>>>>>>>
>>>>>>>> 2014-07-19 5:18 GMT+08:00 Arpit Agarwal <aagarwal@hortonworks.com>:
>>>>>>>>
>>>>>>>>> IMHO this is a spectacularly bad idea. Is it a one off
event? Why
>>>>>>>>> not just take the perf hit and recreate the file?
>>>>>>>>>
>>>>>>>>> If you need to do this regularly you should consider
a mutable
>>>>>>>>> file store like HBase. If you start modifying blocks
from under HDFS you
>>>>>>>>> open up all sorts of consistency issues.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jul 18, 2014 at 2:10 PM, Shumin Guo <gsmsteve@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> That will break the consistency of the file system,
but it
>>>>>>>>>> doesn't hurt to try.
>>>>>>>>>>  On Jul 17, 2014 8:48 PM, "Zesheng Wu" <wuzesheng86@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> How about write a new block with new checksum
file, and replace
>>>>>>>>>>> the old block file and checksum file both?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2014-07-17 19:34 GMT+08:00 Wellington Chevreuil
<
>>>>>>>>>>> wellington.chevreuil@gmail.com>:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> there's no way to do that, as HDFS does not
provide file
>>>>>>>>>>>> updates features. You'll need to write a
new file with the changes.
>>>>>>>>>>>>
>>>>>>>>>>>> Notice that even if you manage to find the
physical block
>>>>>>>>>>>> replica files on the disk, corresponding
to the part of the file you want
>>>>>>>>>>>> to change, you can't simply update it manually,
as this would give a
>>>>>>>>>>>> different checksum, making HDFS mark such
blocks as corrupt.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Wellington.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 17 Jul 2014, at 10:50, Zesheng Wu <wuzesheng86@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> > Hi guys,
>>>>>>>>>>>> >
>>>>>>>>>>>> > I recently encounter a scenario which
needs to replace an
>>>>>>>>>>>> exist block with a newly written block
>>>>>>>>>>>> > The most straightforward way to finish
may be like this:
>>>>>>>>>>>> > Suppose the original file is A, and
we write a new file B
>>>>>>>>>>>> which is composed by the new data blocks,
then we merge A and B to C which
>>>>>>>>>>>> is the file we wanted
>>>>>>>>>>>> > The obvious shortcoming of this method
is wasting of network
>>>>>>>>>>>> bandwidth
>>>>>>>>>>>> >
>>>>>>>>>>>> > I'm wondering whether there is a way
to replace the old block
>>>>>>>>>>>> by the new block directly.
>>>>>>>>>>>> > Any thoughts?
>>>>>>>>>>>> >
>>>>>>>>>>>> > --
>>>>>>>>>>>> > Best Wishes!
>>>>>>>>>>>> >
>>>>>>>>>>>> > Yours, Zesheng
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Best Wishes!
>>>>>>>>>>>
>>>>>>>>>>> Yours, Zesheng
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>> NOTICE: This message is intended for the use of the individual
or
>>>>>>>>> entity to which it is addressed and may contain information
that is
>>>>>>>>> confidential, privileged and exempt from disclosure under
applicable law.
>>>>>>>>> If the reader of this message is not the intended recipient,
you are hereby
>>>>>>>>> notified that any printing, copying, dissemination, distribution,
>>>>>>>>> disclosure or forwarding of this communication is strictly
prohibited. If
>>>>>>>>> you have received this communication in error, please
contact the sender
>>>>>>>>> immediately and delete it from your system. Thank You.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best Wishes!
>>>>>>>>
>>>>>>>> Yours, Zesheng
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Wishes!
>>>>>>
>>>>>> Yours, Zesheng
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Wishes!
>>>>
>>>> Yours, Zesheng
>>>>
>>>
>>>
>>>
>>> --
>>> Best Wishes!
>>>
>>> Yours, Zesheng
>>>
>>
>>
>


-- 
Best Wishes!

Yours, Zesheng

Mime
View raw message