hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: Backing up HDFS
Date Tue, 03 Aug 2010 15:02:48 GMT
On Tue, Aug 3, 2010 at 10:42 AM, Brian Bockelman <bbockelm@cse.unl.edu> wrote:
> On Aug 3, 2010, at 9:12 AM, Eric Sammer wrote:
> <snip/>
> All of that said, what you're protecting against here is permanent loss of a
> data center and human error. Disk, rack, and node level failures are already
> handled by HDFS when properly configured.
> You've forgotten a third cause of loss: undiscovered software bugs.
> The downside of spinning disks is one completely fatal bug can destroy all
> your data in about a minute (at my site, I famously deleted about 100TB in
> 10 minutes with a scratch-space cleanup script gone awry.  That was one
> nasty bug).  This is why we keep good backups.
> If you're very, very serious about archiving and have a huge budget, you
> would invest a few million into a tape silo at multiple sites, flip the
> write-protection tab on the tapes, eject them, and send them off to secure
> facilities.  This isn't for everyone though :)
> Brian

Since HDFS filesystems are usually very large backing them up is a
challenge in itself. This is actually a financial issue as well as a
technical one. A standard DataNode TaskTracker might have hardware
like this:

8 1TB disks
4X quad core CPU

Assuming you are taking the distcp approach you can mirror your
cluster with some scripting/coding. However your destination systems
can be more modest, assuming you wish to use it ONLY for data no job

8 2TB Disks
1x duel core (AMD for low power consumption)
2 GB RAM (if you an even find this little ram on a server class machine)
single power supply
(whatever else you can strip off to save $)

View raw message