hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From max scalf <oracle.bl...@gmail.com>
Subject HDFS backup to S3
Date Wed, 15 Jun 2016 14:15:12 GMT
Hello Hadoop community,

we are running hadoop in AWS(not EMR) but hortonworks distro on EC2
instance.  Everything is all setup and working as expected.  Our design
calls for running HDFS/data nodes on local/ephemeral storage and we have 3X
replication enabled by default, all of the metastore (hive, oozie, ranger,
ambari etc etc ..) are external to the cluster using RDS/mysql.

The question that I have is with regards to backups.  We want to run a
night job that copies data from HDFS into S3.  Knowing that we our cluster
lives in AWS, the obvious choice is to run our backup to S3.  We do not
want a warm backup(backup this cluster to another cluster), our RTO/RPO is
5 days for this cluster.  So we can run distcp (something like below link)
to backup our hdfs to S3 and we have tested this and works just fine, but
how do we go about storage the ownership/permission on these files.


As S3 is a blob storage and does not store any ownership/permission, how do
we go about backing that up?  One of the ideas I had was to run hdfs dfs
-lsr (and recursively get all files and folders permissions/ownership) and
dump that into a file and send that file over to S3 as well, but I am
guessing it will work now but as the cluster grows it might not scale...

So I wanted to find out how are people managed backing up
ownership/permission of HDFS file/folder when sending back up to a blob
storage like S3.

View raw message