hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Shvachko <...@yahoo-inc.com>
Subject Re: Hadoop Distributed File System requirements on Wiki
Date Fri, 07 Jul 2006 19:32:08 GMT
Paul Sutter wrote:

> On 7/7/06, Konstantin Shvachko <shv@yahoo-inc.com> wrote:
>> > *Recoverability and Availability Goals*
>> >
>> > You might want to consider adding recoverability and availability 
>> goals.
>> This is an interesting observation. Ideally, we would like to save and
>> replicate
>> fs image file as soon as the edits file reaches a specific size, and we
>> would like
>> to make edits file updates transactional, with the file system locked
>> for updates
>> during the transaction. This would be the zero recoverability goal in
>> your terms.
>> Are we willing to weaken this requirement in favor of the performance?
> Actually its OK for me if we lose even an hour of data on a namenode
> crash, since I can just resubmit the recent jobs. Less loss is better,
> but my suggestion would be to favor simplicity over absolute recovery
> if thats a tradeoff. Others might feel differently about acceptable
> levels of data loss.

I agree, simplicity is also very important.

>> > Availability goals are probably less stringent than for most storage
>> > systems
>> > (dare I say that a few hours downtime is probably OK) Adding these
>> > goals to
>> > the document could be valuable for consensus and prioritization.
>> If I understood you correctly, this goal is more related to a specific
>> installation of
>> the system rather than to the system itself as a software product.
>> Or do you mean that the total time spent by the system on 
>> self-maintenance
>> procedures like backups and checkpointing should not exceed 2 hours a 
>> day?
>> In any case, I agree, high availability should be mentioned, probably 
>> in the
>> "Feature requirements" section.
> Its about features. Is namenode failover automatic or manual? If its
> manual, it takes time. And it should definitely be manual for now.
> Seamless namenode failover done right is a lot of work, and
> unnecessary.
> With manual failover, what is the downtime when a namenode fails?
> Well, I imagine that you'd want to take everything down, bring the
> filesystem up in safe mode (nice feature!) on the new namenode, and do
> some kind of fscheck. And then, when you're comfortable that
> everything is copacetic, all your files are present, and that the
> filesystem wont do a radical dereplication of every block when you
> make it writable, you make it writable. (In fact, the secondary
> namenode might always come up in safe mode until manually changed).
> How long does this take? Well, during this time the system is
> unavailable. And if it fails at 2AM, you're probably not back up
> before 10AM.
> But thats OK. Better to be down for a few hours (manual failover) than
> to have a complex system likely to break (seamless automatic
> failover).

That's a good point. We should probably add a task to define/describe
manual failover procedures and to evaluate the availability goal that we
can reasonably guarantee.

>> >> > *Backup Scheme*
>> >> > **
>> >> > We might want to start discussion of a backup scheme for HDFS,
>> >> > especially
>> >> > given all the courageous rewriting and feature-addition likely to
>> >> > occur.
>> >>...
>> >
>> > But as for covering my fears, I'll feel safer with key data backed up
>> > in a filesystem that is not DFS, as pedestrian as that sounds. :)
>> Frankly speaking I've never thought about a backup of a 10 PB storage
>> system. How much space will that require? Isn't it easier just to 
>> increase
>> the replication factor? Just a thought...
> >> > **
> Increasing replication doesnt protect me against a filesystem bug.
> I'm a nervous nelly on this one: file system revisions do scare me,
> and I dont have a 10PB system. Lets say I have a 100TB system, and
> that to get back into production I need only restore 5TB worth of
> critical files. Then once I'm back in production I can gradually
> restore the next 25TB and regenerate the rest.
> Its feasible and probably prudent. Its not that Im expecting data loss
> bugs in  new code. My concern is less about the likelihood of the
> problem, and more about the severity of the problem.
> To back up a 10PB system, you would want to back it up to a second
> 10PB system located on an opposite coast. In fact if this system is
> important to your business, you must do this. And then there is the
> question, do you stagger software updates on these two systems?
> Probably.
> You might want to find someone from EMC or Netapp, and get their
> feedback on how software changes, QA, and beta testing is handled
> (including timelines). Storage systems are really a risky type of code
> to modify, for lots of reasons more apparent to the downstream
> consumers than to developers. :)

I guess if we want to separate the backup from the original storage
on the hardware level we have two options
a) mirror data to another dfs cluster (earlier version, opposite cost)
b) copy critical data to a different (local) fs
If only 5% of the whole data set is critical you might want to go with (b).
This can be a separate (dfs based) application or an extension to dfs.
If ~100% is critical then (a) is the only way.
On a related issue, do we want to add the upgrade procedures task to the 


View raw message