hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Sutter" <sut...@gmail.com>
Subject Re: Hadoop Distributed File System requirements on Wiki
Date Fri, 07 Jul 2006 15:51:26 GMT
thanks! comments below...

On 7/7/06, Konstantin Shvachko <shv@yahoo-inc.com> wrote:
> Paul Sutter wrote:
> > One more suggestion:  store a copy of the per-block metadata on the
> > datanode. ..
> Being able to reconstruct the system even if the checkpoint is lost
> forever is
> a nice feature to have. The "original file name" can be placed into the
> crc file (#15)...

Great place to put it, very nice.

> > *Recoverability and Availability Goals*
> >
> > You might want to consider adding recoverability and availability goals.
> This is an interesting observation. Ideally, we would like to save and
> replicate
> fs image file as soon as the edits file reaches a specific size, and we
> would like
> to make edits file updates transactional, with the file system locked
> for updates
> during the transaction. This would be the zero recoverability goal in
> your terms.
> Are we willing to weaken this requirement in favor of the performance?

Actually its OK for me if we lose even an hour of data on a namenode
crash, since I can just resubmit the recent jobs. Less loss is better,
but my suggestion would be to favor simplicity over absolute recovery
if thats a tradeoff. Others might feel differently about acceptable
levels of data loss.

> > Availability goals are probably less stringent than for most storage
> > systems
> > (dare I say that a few hours downtime is probably OK) Adding these
> > goals to
> > the document could be valuable for consensus and prioritization.
> If I understood you correctly, this goal is more related to a specific
> installation of
> the system rather than to the system itself as a software product.
> Or do you mean that the total time spent by the system on self-maintenance
> procedures like backups and checkpointing should not exceed 2 hours a day?
> In any case, I agree, high availability should be mentioned, probably in the
> "Feature requirements" section.

Its about features. Is namenode failover automatic or manual? If its
manual, it takes time. And it should definitely be manual for now.
Seamless namenode failover done right is a lot of work, and

With manual failover, what is the downtime when a namenode fails?
Well, I imagine that you'd want to take everything down, bring the
filesystem up in safe mode (nice feature!) on the new namenode, and do
some kind of fscheck. And then, when you're comfortable that
everything is copacetic, all your files are present, and that the
filesystem wont do a radical dereplication of every block when you
make it writable, you make it writable. (In fact, the secondary
namenode might always come up in safe mode until manually changed).

How long does this take? Well, during this time the system is
unavailable. And if it fails at 2AM, you're probably not back up
before 10AM.

But thats OK. Better to be down for a few hours (manual failover) than
to have a complex system likely to break (seamless automatic

> >> > **
> >> > *Backup Scheme*
> >> > **
> >> > We might want to start discussion of a backup scheme for HDFS,
> >> > especially
> >> > given all the courageous rewriting and feature-addition likely to
> >> > occur.
> >>...
> >
> > But as for covering my fears, I'll feel safer with key data backed up
> > in a filesystem that is not DFS, as pedestrian as that sounds. :)
> Frankly speaking I've never thought about a backup of a 10 PB storage
> system. How much space will that require? Isn't it easier just to increase
> the replication factor? Just a thought...

Increasing replication doesnt protect me against a filesystem bug.

I'm a nervous nelly on this one: file system revisions do scare me,
and I dont have a 10PB system. Lets say I have a 100TB system, and
that to get back into production I need only restore 5TB worth of
critical files. Then once I'm back in production I can gradually
restore the next 25TB and regenerate the rest.

Its feasible and probably prudent. Its not that Im expecting data loss
bugs in  new code. My concern is less about the likelihood of the
problem, and more about the severity of the problem.

To back up a 10PB system, you would want to back it up to a second
10PB system located on an opposite coast. In fact if this system is
important to your business, you must do this. And then there is the
question, do you stagger software updates on these two systems?

You might want to find someone from EMC or Netapp, and get their
feedback on how software changes, QA, and beta testing is handled
(including timelines). Storage systems are really a risky type of code
to modify, for lots of reasons more apparent to the downstream
consumers than to developers. :)

View raw message