hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zesheng Wu <wuzeshen...@gmail.com>
Subject Re: Replace a block with a new one
Date Tue, 22 Jul 2014 01:54:07 GMT
Mmm, it seems that the facebook branch
https://github.com/facebook/hadoop-20/
<https://github.com/facebook/hadoop-20/blob/master/src/contrib/raid/src/java/org/apache/hadoop/hdfs/DistributedRaidFileSystem.java>
has
implemented reed-solomon codes, what I was checking earlier were the
following two issues:
https://issues.apache.org/jira/browse/HDFS-503
https://issues.apache.org/jira/browse/HDFS-600

Thanks again Bertrand, I will check through the facebook branch to find
more information.


2014-07-22 9:31 GMT+08:00 Zesheng Wu <wuzesheng86@gmail.com>:

> Thank Bertrand, I've checked these information earlier. There's only XOR
> implementation, and missed blocks are reconstructed by creating new files.
>
>
>
> 2014-07-22 3:47 GMT+08:00 Bertrand Dechoux <dechouxb@gmail.com>:
>
> And there is actually quite a lot of information about it.
>>
>>
>> https://github.com/facebook/hadoop-20/blob/master/src/contrib/raid/src/java/org/apache/hadoop/hdfs/DistributedRaidFileSystem.java
>>
>> http://wiki.apache.org/hadoop/HDFS-RAID
>>
>>
>> https://issues.apache.org/jira/browse/MAPREDUCE/component/12313416/?selectedTab=com.atlassian.jira.jira-projects-plugin:component-summary-panel
>>
>> Bertrand Dechoux
>>
>>
>> On Mon, Jul 21, 2014 at 3:46 PM, Bertrand Dechoux <dechouxb@gmail.com>
>> wrote:
>>
>>> I wrote my answer thinking about the XOR implementation. With
>>> reed-solomon and single replication, the cases that need to be considered
>>> are indeed smaller, simpler.
>>>
>>> It seems I was wrong about my last statement though. If the machine
>>> hosting a single-replicated block is lost, it isn't likely that the block
>>> can't be reconstructed from a summary of the data. But with a RAID6
>>> strategy / RS, its is indeed possible but of course up to a certain number
>>> of blocks.
>>>
>>> There clearly is a case for such tool on a Hadoop cluster.
>>>
>>> Bertrand Dechoux
>>>
>>>
>>> On Mon, Jul 21, 2014 at 2:35 PM, Zesheng Wu <wuzesheng86@gmail.com>
>>> wrote:
>>>
>>>>  If a block is corrupted but hasn't been detected by HDFS, you could
>>>> delete the block from the local filesystem (it's only a file) then HDFS
>>>> will replicate the good remaining replica of this block.
>>>>
>>>> We only have one replica for each block, if a block is corrupted, HDFS
>>>> cannot replicate it.
>>>>
>>>>
>>>> 2014-07-21 20:30 GMT+08:00 Zesheng Wu <wuzesheng86@gmail.com>:
>>>>
>>>> Thanks Bertrand, my reply comments inline following.
>>>>>
>>>>> So you know that a block is corrupted thanks to an external process
>>>>> which in this case is checking the parity blocks. If a block is corrupted
>>>>> but hasn't been detected by HDFS, you could delete the block from the
local
>>>>> filesystem (it's only a file) then HDFS will replicate the good remaining
>>>>> replica of this block.
>>>>>  *[Zesheng: We will implement a periodical checking mechanism to
>>>>> check the corrupted blocks.]*
>>>>>
>>>>> For performance reason (and that's what you want to do?), you might be
>>>>> able to fix the corruption without needing to retrieve the good replica.
It
>>>>> might be possible by working directly with the local system by replacing
>>>>> the corrupted block by the corrected block (which again are files). On
>>>>> issue is that the corrected block might be different than the good replica.
>>>>> If HDFS is able to tell (with CRC) it might be good else you will end
up
>>>>> with two different good replicas for the same block and that will not
be
>>>>> pretty...
>>>>> *[Zesheng: Indeed we will use the reed-solomon erasure codes to
>>>>> implement the HDFS raid, so the corrupted block will be recovered from
the
>>>>> back good data and coding blocks]*
>>>>>
>>>>> If the result is to be open source, you might want to check with
>>>>> Facebook about their implementation and track the process within Apache
>>>>> JIRA. You could gain additional feedbacks. One downside of HDFS RAID
is
>>>>> that the less replicas there is, the less read of the data for processing
>>>>> will be 'efficient/fast'. Reducing the number of replicas also diminishes
>>>>> the number of supported node failures. I wouldn't say it's an easy ride.
>>>>> *[Zesheng: Yes, I agree with you that the read performance downgrade,
>>>>> but not with the number of supported node failures, reed-solomon algorithm
>>>>> can maintain equal or even higher node failures **tolerance. About
>>>>> open source, we will consider this in the future.**]*
>>>>>
>>>>>
>>>>> 2014-07-21 20:01 GMT+08:00 Bertrand Dechoux <dechouxb@gmail.com>:
>>>>>
>>>>> So you know that a block is corrupted thanks to an external process
>>>>>> which in this case is checking the parity blocks. If a block is corrupted
>>>>>> but hasn't been detected by HDFS, you could delete the block from
the local
>>>>>> filesystem (it's only a file) then HDFS will replicate the good remaining
>>>>>> replica of this block.
>>>>>>
>>>>>> For performance reason (and that's what you want to do?), you might
>>>>>> be able to fix the corruption without needing to retrieve the good
replica.
>>>>>> It might be possible by working directly with the local system by
replacing
>>>>>> the corrupted block by the corrected block (which again are files).
On
>>>>>> issue is that the corrected block might be different than the good
replica.
>>>>>> If HDFS is able to tell (with CRC) it might be good else you will
end up
>>>>>> with two different good replicas for the same block and that will
not be
>>>>>> pretty...
>>>>>>
>>>>>> If the result is to be open source, you might want to check with
>>>>>> Facebook about their implementation and track the process within
Apache
>>>>>> JIRA. You could gain additional feedbacks. One downside of HDFS RAID
is
>>>>>> that the less replicas there is, the less read of the data for processing
>>>>>> will be 'efficient/fast'. Reducing the number of replicas also diminishes
>>>>>> the number of supported node failures. I wouldn't say it's an easy
ride.
>>>>>>
>>>>>> Bertrand Dechoux
>>>>>>
>>>>>>
>>>>>> On Mon, Jul 21, 2014 at 1:29 PM, Zesheng Wu <wuzesheng86@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> We want to implement a RAID on top of HDFS, something like facebook
>>>>>>> implemented as described in:
>>>>>>> https://code.facebook.com/posts/536638663113101/saving-capacity-with-hdfs-raid/
>>>>>>>
>>>>>>>
>>>>>>> 2014-07-21 17:19 GMT+08:00 Bertrand Dechoux <dechouxb@gmail.com>:
>>>>>>>
>>>>>>> You want to implement a RAID on top of HDFS or use HDFS on top
of
>>>>>>>> RAID? I am not sure I understand any of these use cases.
HDFS handles for
>>>>>>>> you replication and error detection. Fine tuning the cluster
wouldn't be
>>>>>>>> the easier solution?
>>>>>>>>
>>>>>>>> Bertrand Dechoux
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jul 21, 2014 at 7:25 AM, Zesheng Wu <wuzesheng86@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks for reply, Arpit.
>>>>>>>>> Yes, we need to do this regularly. The original requirement
of
>>>>>>>>> this is that we want to do RAID(which is based reed-solomon
erasure codes)
>>>>>>>>> on our HDFS cluster. When a block is corrupted or missing,
the downgrade
>>>>>>>>> read needs quick recovery of the block. We are considering
how to recovery
>>>>>>>>> the corrupted/missing block quickly.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2014-07-19 5:18 GMT+08:00 Arpit Agarwal <aagarwal@hortonworks.com>
>>>>>>>>> :
>>>>>>>>>
>>>>>>>>>> IMHO this is a spectacularly bad idea. Is it a one
off event? Why
>>>>>>>>>> not just take the perf hit and recreate the file?
>>>>>>>>>>
>>>>>>>>>> If you need to do this regularly you should consider
a mutable
>>>>>>>>>> file store like HBase. If you start modifying blocks
from under HDFS you
>>>>>>>>>> open up all sorts of consistency issues.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Jul 18, 2014 at 2:10 PM, Shumin Guo <gsmsteve@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> That will break the consistency of the file system,
but it
>>>>>>>>>>> doesn't hurt to try.
>>>>>>>>>>>  On Jul 17, 2014 8:48 PM, "Zesheng Wu" <wuzesheng86@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> How about write a new block with new checksum
file, and replace
>>>>>>>>>>>> the old block file and checksum file both?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2014-07-17 19:34 GMT+08:00 Wellington Chevreuil
<
>>>>>>>>>>>> wellington.chevreuil@gmail.com>:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> there's no way to do that, as HDFS does
not provide file
>>>>>>>>>>>>> updates features. You'll need to write
a new file with the changes.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Notice that even if you manage to find
the physical block
>>>>>>>>>>>>> replica files on the disk, corresponding
to the part of the file you want
>>>>>>>>>>>>> to change, you can't simply update it
manually, as this would give a
>>>>>>>>>>>>> different checksum, making HDFS mark
such blocks as corrupt.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Wellington.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 17 Jul 2014, at 10:50, Zesheng Wu
<wuzesheng86@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> > Hi guys,
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > I recently encounter a scenario
which needs to replace an
>>>>>>>>>>>>> exist block with a newly written block
>>>>>>>>>>>>> > The most straightforward way to
finish may be like this:
>>>>>>>>>>>>> > Suppose the original file is A,
and we write a new file B
>>>>>>>>>>>>> which is composed by the new data blocks,
then we merge A and B to C which
>>>>>>>>>>>>> is the file we wanted
>>>>>>>>>>>>> > The obvious shortcoming of this
method is wasting of network
>>>>>>>>>>>>> bandwidth
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > I'm wondering whether there is a
way to replace the old
>>>>>>>>>>>>> block by the new block directly.
>>>>>>>>>>>>> > Any thoughts?
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > --
>>>>>>>>>>>>> > Best Wishes!
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > Yours, Zesheng
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Best Wishes!
>>>>>>>>>>>>
>>>>>>>>>>>> Yours, Zesheng
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>>> NOTICE: This message is intended for the use of the
individual or
>>>>>>>>>> entity to which it is addressed and may contain information
that is
>>>>>>>>>> confidential, privileged and exempt from disclosure
under applicable law.
>>>>>>>>>> If the reader of this message is not the intended
recipient, you are hereby
>>>>>>>>>> notified that any printing, copying, dissemination,
distribution,
>>>>>>>>>> disclosure or forwarding of this communication is
strictly prohibited. If
>>>>>>>>>> you have received this communication in error, please
contact the sender
>>>>>>>>>> immediately and delete it from your system. Thank
You.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Best Wishes!
>>>>>>>>>
>>>>>>>>> Yours, Zesheng
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best Wishes!
>>>>>>>
>>>>>>> Yours, Zesheng
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Wishes!
>>>>>
>>>>> Yours, Zesheng
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Wishes!
>>>>
>>>> Yours, Zesheng
>>>>
>>>
>>>
>>
>
>
> --
> Best Wishes!
>
> Yours, Zesheng
>



-- 
Best Wishes!

Yours, Zesheng

Mime
View raw message