On Aug 3, 2010, at 9:12 AM, Eric Sammer wrote:
All of that said, what you're protecting against here is permanent loss of a
data center and human error. Disk, rack, and node level failures are already
handled by HDFS when properly configured.
You've forgotten a third cause of loss: undiscovered software bugs.
The downside of spinning disks is one completely fatal bug can destroy all your data in about a minute (at my site, I famously deleted about 100TB in 10 minutes with a scratch-space cleanup script gone awry. That was one nasty bug). This is why we keep good backups.
If you're very, very serious about archiving and have a huge budget, you would invest a few million into a tape silo at multiple sites, flip the write-protection tab on the tapes, eject them, and send them off to secure facilities. This isn't for everyone though :)