hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anu Engineer <aengin...@hortonworks.com>
Subject Re: HDFS backup to S3
Date Wed, 15 Jun 2016 15:59:40 GMT
Hi Max,

Unfortunately, we don’t have a better solution at the moment. I am wondering if the right
approach might be to use user-defined metadata (http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html)
and put that information along with the object that we are backing up.

However, that would be a code change in DistCp, and not as easy as a script. But that would
address the scalability issue that you are worried about.

Thanks
Anu



From: max scalf <oracle.blog3@gmail.com>
Date: Wednesday, June 15, 2016 at 7:15 AM
To: HDP mailing list <user@hadoop.apache.org>
Subject: HDFS backup to S3

Hello Hadoop community,

we are running hadoop in AWS(not EMR) but hortonworks distro on EC2 instance.  Everything
is all setup and working as expected.  Our design calls for running HDFS/data nodes on local/ephemeral
storage and we have 3X replication enabled by default, all of the metastore (hive, oozie,
ranger, ambari etc etc ..) are external to the cluster using RDS/mysql.

The question that I have is with regards to backups.  We want to run a night job that copies
data from HDFS into S3.  Knowing that we our cluster lives in AWS, the obvious choice is to
run our backup to S3.  We do not want a warm backup(backup this cluster to another cluster),
our RTO/RPO is 5 days for this cluster.  So we can run distcp (something like below link)
to backup our hdfs to S3 and we have tested this and works just fine, but how do we go about
storage the ownership/permission on these files.

http://www.nixguys.com/blog/backup-hadoop-hdfs-amazon-s3-shell-script

As S3 is a blob storage and does not store any ownership/permission, how do we go about backing
that up?  One of the ideas I had was to run hdfs dfs -lsr (and recursively get all files and
folders permissions/ownership) and dump that into a file and send that file over to S3 as
well, but I am guessing it will work now but as the cluster grows it might not scale...

So I wanted to find out how are people managed backing up ownership/permission of HDFS file/folder
when sending back up to a blob storage like S3.


Mime
View raw message