hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Hammerbacher <ham...@cloudera.com>
Subject Re: Backing up HDFS?
Date Tue, 10 Feb 2009 01:47:42 GMT

There's also a ticket open to enable global snapshots for a single HDFS
instance: https://issues.apache.org/jira/browse/HADOOP-3637. While this
doesn't solve the multi-site backup issue, it does provide stronger
protection against programmatic deletion of data in a single cluster.


On Mon, Feb 9, 2009 at 5:22 PM, Allen Wittenauer <aw@yahoo-inc.com> wrote:

> On 2/9/09 4:41 PM, "Amandeep Khurana" <amansk@gmail.com> wrote:
> > Why would you want to have another backup beyond HDFS? HDFS itself
> > replicates your data so if the reliability of the system shouldnt be a
> > concern (if at all it is)...
> I'm reminded of a previous job where a site administrator refused to make
> tape backups (despite our continual harassment and pointing out that he was
> in violation of the contract) because he said RAID was "good enough".
> Then the RAID controller failed. When we couldn't recover data "from the
> other mirror" he was fired.  Not sure how they ever recovered, esp.
> considering what the data was they lost.  Hopefully they had a paper trail.
> To answer Nathan's question:
> > On Mon, Feb 9, 2009 at 4:17 PM, Nathan Marz <nathan@rapleaf.com> wrote:
> >
> >> How do people back up their data that they keep on HDFS? We have many TB
> of
> >> data which we need to get backed up but are unclear on how to do this
> >> efficiently/reliably.
> The content of our HDFSes is loaded from elsewhere and is not considered
> 'the source of authority'.  It is the responsibility of the original
> sources
> to maintain backups and we then follow their policies for data retention.
> For user generated content, we provide *limited* (read: quota'ed) NFS space
> that is backed up regularly.
> Another strategy we take is multiple grids in multiple locations that get
> the data loaded simultaneously.
> The key here is to prioritize your data.  Impossible to replicate data gets
> backed up using whatever means necessary, hard-to-regenerate data, next
> priority. Easy to regenerate and ok to nuke data, doesn't get backed up.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message