hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Shvachko <...@yahoo-inc.com>
Subject Re: Hadoop Distributed File System requirements on Wiki
Date Fri, 07 Jul 2006 07:29:58 GMT
Hi Paul,

Paul Sutter wrote:

> Eric,
>
> Thanks - response embedded below.
>
> One more suggestion:  store a copy of the per-block metadata on the
> datanode. it doesnt have to have an updated copy of the filename, just
> the "original file name" and block offset would be fine. since you're
> adding truncation features, you'd want some kind of truncation
> generation number too. this would make possible a distributed namenode
> recovery, which is belt-and-suspenders valuable even after adding
> checkpointing features to the namenode. storing this metadata is more
> important than writing the recovery program, since the recovery
> program could be written after the disaster that makes it necessary.
> (just a suggestion). 

Being able to reconstruct the system even if the checkpoint is lost 
forever is
a nice feature to have. The "original file name" can be placed into the 
crc file (#15)
related to the given block. The offset or the block sequence number to 
be precise
can be a part of the block id followed by a random number. Generation ## 
will
be required as soon as we  start implementing  concurrent appends and 
truncates.
They can be also encoded into the local file name representing the 
block. Say,
<block-id>.<generation#>

> On 7/6/06, Eric Baldeschwieler <eric14@yahoo-inc.com> wrote:
>
>>
>> On Jul 6, 2006, at 12:02 PM, Paul Sutter wrote:
>>
>> ...
>> > *Constant size file blocks (#16),  -1*
>> >
>> > I vote to keep variable size blocks, especially because you are adding
>> > atomic append capabilities (#25). Variable length blocks creates the
>> > possibility for blocks that contain only whole records. This:
>> > - improves recoverability for large important files with one or more
>> > irrevocably lost blocks, and
>> > - makes it very clean for mappers to process local data blocks
>>
>> ...  I think we can achieve our goal without compromising yours.
>> Each block can be of any size up to the files fixed block size.  The
>> system can be aware of that and provide an API to report gaps and/or
>> an API option to skip them or see them as NULLs.  This reporting can
>> be done at the datanode level allowing us to remove all the size data
>> & logic at the namenode level.
>>
>> ** If you agree, why don't we just add the above annotation to
>> konstantine's doc?
>
>
> Wow! Good idea, and now I see why you wanted to make the change in the
> first place. I agree, please go ahead and add.
>
> Incidently, its probably fine if
> - the API just skipped the ghost bytes,
> - programs using such files should only ever seek to locations that
> had been returned by getPos(), and
> - getPos() should return the byte offset of the next block as soon as
> a ghost byte is reached.
>
> I think existing programs will work fine within these restrictions.
> The last one is intended for code like SequenceFile that checks
> current position against file length when reading data.
> (SequenceFiles' syncing code might have to get reconsidered, but would
> be easier since you'd just advance to the next block on a checksum
> failure).

Yes, the intention was not to make the blocks literally of constants 
size, but
to let datanodes deal with incomplete blocks while the namenode would treat
them equally. I'll add this clarifying ideas to the document.
Good thing with Wiki is it's editable :-)

> *Recoverability and Availability Goals*
>
> You might want to consider adding recoverability and availability goals.
> Recoverability goals might include data lost in case of a namenode 
> failure
> (today its about a year, but it could be day-hour-minute-second-zero at
> varying costs). If we have a statistically inclined person on the 
> project,
> we could estimate the acceptable block loss probabilities at scale.

This is an interesting observation. Ideally, we would like to save and 
replicate
fs image file as soon as the edits file reaches a specific size, and we 
would like
to make edits file updates transactional, with the file system locked 
for updates
during the transaction. This would be the zero recoverability goal in 
your terms.
Are we willing to weaken this requirement in favor of the performance?

> Availability goals are probably less stringent than for most storage 
> systems
> (dare I say that a few hours downtime is probably OK) Adding these 
> goals to
> the document could be valuable for consensus and prioritization.

If I understood you correctly, this goal is more related to a specific 
installation of
the system rather than to the system itself as a software product.
Or do you mean that the total time spent by the system on self-maintenance
procedures like backups and checkpointing should not exceed 2 hours a day?
In any case, I agree, high availability should be mentioned, probably in the
"Feature requirements" section.

>> > **
>> > *Backup Scheme*
>> > **
>> > We might want to start discussion of a backup scheme for HDFS,
>> > especially
>> > given all the courageous rewriting and feature-addition likely to
>> > occur.
>>
>> ** I agree, this needs to be on the list.  I'm imagining a command
>> that hardlinks every datanode's (and namenode's if needed) files into
>> a snapshot directory.  And another command that moves all current
>> state into a snapshot directory and hardlinks a snapshot's state back
>> into the working directory.  This would be very fast and not cost
>> much space in the short term.  Thoughts?  (yes, hardlinks are a pain
>> on the PC, we can discuss design later)
>
> This is a fantastic idea.
>
> But as for covering my fears, I'll feel safer with key data backed up
> in a filesystem that is not DFS, as pedestrian as that sounds. :)

Frankly speaking I've never thought about a backup of a 10 PB storage
system. How much space will that require? Isn't it easier just to increase
the replication factor? Just a thought...

>> > *Rebalancing (#22,#21)*
>> >
>> > I would suggest that keeping disk usage balanced is more than a
>> > performance
>> > feature, its important for the success of running jobs with large map
>> > outputs or large sorts. Our most common reducer failure is running
>> > out of
>> > disk space during sort, and this is caused by imbalanced block
>> > allocation.
>>
>> ** Good point.  Any interest in helping us with this one?
>
> We'll take a look at it.

Very true.
Many features and performance tasks can be considered reliability tasks
from certain point of view. We've also seen uneven distribution of data,
which led to task failures. But this is sort of a higher order balancing 
- between
different subsystems competing for the storage resources on the same node.


>> > On 6/30/06, Konstantin Shvachko <shv@yahoo-inc.com> wrote:
>> >>
>> >> I've created a Wiki page that summarizes DFS requirements and
>> >> proposed
>> >> changes.
>> >> This is a summary of discussions held in this mailing list and
>> >> additional internal discussions.
>> >> The page is here:
>> >>
>> >> http://wiki.apache.org/lucene-hadoop/DFS_requirements
>> >>
>> >> I see there is an ongoing related discussion in HADOOP-337.
>> >> We prioritized our goals as
>> >> (1) Reliability (which includes Recoverability and Availability)
>> >> (2) Scalability
>> >> (3) Functionality
>> >> (4) Performance
>> >> (5) other
>> >> But then gave higher priority to some features like the append
>> >> functionality.
>> >>
>> >> Happy holidays to everybody.
>> >>
>> >> --Konstantin Shvachko
>> >>
>>
>>
> >
>
>
>


Mime
View raw message