hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anu Engineer <aengin...@hortonworks.com>
Subject Re: HDFS backup to S3
Date Wed, 15 Jun 2016 22:14:38 GMT
Sorry my bad, http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html   the closing
bracket was attached to the URL.


From: max scalf <oracle.blog3@gmail.com>
Date: Wednesday, June 15, 2016 at 2:48 PM
To: Anu Engineer <aengineer@hortonworks.com>, HDP mailing list <user@hadoop.apache.org>
Subject: Re: HDFS backup to S3

Hi Anu,

Thank for the information, the link you provided does not work.


Let me do some quick research on what you guys can provide and get back to you.
On Wed, Jun 15, 2016, 10:59 AM Anu Engineer <aengineer@hortonworks.com<mailto:aengineer@hortonworks.com>>
Hi Max,

Unfortunately, we don’t have a better solution at the moment. I am wondering if the right
approach might be to use user-defined metadata (http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html)
and put that information along with the object that we are backing up.

However, that would be a code change in DistCp, and not as easy as a script. But that would
address the scalability issue that you are worried about.


From: max scalf <oracle.blog3@gmail.com<mailto:oracle.blog3@gmail.com>>
Date: Wednesday, June 15, 2016 at 7:15 AM
To: HDP mailing list <user@hadoop.apache.org<mailto:user@hadoop.apache.org>>
Subject: HDFS backup to S3

Hello Hadoop community,

we are running hadoop in AWS(not EMR) but hortonworks distro on EC2 instance.  Everything
is all setup and working as expected.  Our design calls for running HDFS/data nodes on local/ephemeral
storage and we have 3X replication enabled by default, all of the metastore (hive, oozie,
ranger, ambari etc etc ..) are external to the cluster using RDS/mysql.

The question that I have is with regards to backups.  We want to run a night job that copies
data from HDFS into S3.  Knowing that we our cluster lives in AWS, the obvious choice is to
run our backup to S3.  We do not want a warm backup(backup this cluster to another cluster),
our RTO/RPO is 5 days for this cluster.  So we can run distcp (something like below link)
to backup our hdfs to S3 and we have tested this and works just fine, but how do we go about
storage the ownership/permission on these files.


As S3 is a blob storage and does not store any ownership/permission, how do we go about backing
that up?  One of the ideas I had was to run hdfs dfs -lsr (and recursively get all files and
folders permissions/ownership) and dump that into a file and send that file over to S3 as
well, but I am guessing it will work now but as the cluster grows it might not scale...

So I wanted to find out how are people managed backing up ownership/permission of HDFS file/folder
when sending back up to a blob storage like S3.

View raw message