hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Sammer <esam...@cloudera.com>
Subject Re: Backing up HDFS
Date Tue, 03 Aug 2010 14:12:29 GMT

For backing up HDFS you have 3 options. Two of them are application based
and one is tool based.

1. The distcp command will copy HDFS data in parallel between clusters. See
'hadoop distcp' for details.
2. Upon copying data into HDFS (on data ingestion / incoming ETL) you could
"fan out" the incoming data stream and send a copy to more than one cluster
and run the same processing in both places.
3. As part of the MR jobs that do any daily processing, you could write
"change log" style logs and ship them between clusters. This is similar to
what relational databases do and amounts to incremental log shipping and

Of course, for all of these options, a second cluster of similar size to the
first is required. Options 2 and 3 require custom development. In practice,
people who need this level of protection normally use a combination of
techniques based on the processing semantics. They each have trade offs.

All of that said, what you're protecting against here is permanent loss of a
data center and human error. Disk, rack, and node level failures are already
handled by HDFS when properly configured. You have to do the cost / benefit
analysis for yourself to decide if it's worth the time, effort, complexity,
and maintenance.

On Tue, Aug 3, 2010 at 9:54 AM, dan.paulus <dan.paulus@bronto.com> wrote:

> So I am administering a 10+ node hadoop cluster and everything is going
> swimmingly.  Unfortunately, some relatively critical data is now being
> stored on the cluster and I am being asked to create a backup solution for
> hadoop in case of catasrophic failure of the data center, the application
> creating data corruption, and ultimately my company wants that warm fuzzy
> feeling that only an offsite backup can provide.
> So does anyone else actually backup HDFS?  After a quick google and forum
> search I found the following link that creates a full backup and then
> incremental backups, anyone use this or something similar?
> http://blog.rapleaf.com/dev/2009/06/05/backing-up-hadoops-hdfs/
> http://blog.rapleaf.com/dev/2009/06/05/backing-up-hadoops-hdfs/
> Thanks in advance.
> --
> View this message in context:
> http://old.nabble.com/Backing-up-HDFS-tp29335698p29335698.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.

Eric Sammer
twitter: esammer
data: www.cloudera.com

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message