hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From max scalf <oracle.bl...@gmail.com>
Subject Re: HDFS backup to S3
Date Wed, 15 Jun 2016 21:48:13 GMT
Hi Anu,

Thank for the information, the link you provided does not work.

@Hari,

Let me do some quick research on what you guys can provide and get back to
you.

On Wed, Jun 15, 2016, 10:59 AM Anu Engineer <aengineer@hortonworks.com>
wrote:

> Hi Max,
>
>
>
> Unfortunately, we don’t have a better solution at the moment. I am
> wondering if the right approach might be to use user-defined metadata (
> http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html) and
> put that information along with the object that we are backing up.
>
>
>
> However, that would be a code change in DistCp, and not as easy as a
> script. But that would address the scalability issue that you are worried
> about.
>
>
>
> Thanks
>
> Anu
>
>
>
>
>
>
>
> *From: *max scalf <oracle.blog3@gmail.com>
> *Date: *Wednesday, June 15, 2016 at 7:15 AM
> *To: *HDP mailing list <user@hadoop.apache.org>
> *Subject: *HDFS backup to S3
>
>
>
> Hello Hadoop community,
>
>
>
> we are running hadoop in AWS(not EMR) but hortonworks distro on EC2
> instance.  Everything is all setup and working as expected.  Our design
> calls for running HDFS/data nodes on local/ephemeral storage and we have 3X
> replication enabled by default, all of the metastore (hive, oozie, ranger,
> ambari etc etc ..) are external to the cluster using RDS/mysql.
>
>
>
> The question that I have is with regards to backups.  We want to run a
> night job that copies data from HDFS into S3.  Knowing that we our cluster
> lives in AWS, the obvious choice is to run our backup to S3.  We do not
> want a warm backup(backup this cluster to another cluster), our RTO/RPO is
> 5 days for this cluster.  So we can run distcp (something like below link)
> to backup our hdfs to S3 and we have tested this and works just fine, but
> how do we go about storage the ownership/permission on these files.
>
>
>
> http://www.nixguys.com/blog/backup-hadoop-hdfs-amazon-s3-shell-script
>
>
>
> As S3 is a blob storage and does not store any ownership/permission, how
> do we go about backing that up?  One of the ideas I had was to run hdfs dfs
> -lsr (and recursively get all files and folders permissions/ownership) and
> dump that into a file and send that file over to S3 as well, but I am
> guessing it will work now but as the cluster grows it might not scale...
>
>
>
> So I wanted to find out how are people managed backing up
> ownership/permission of HDFS file/folder when sending back up to a blob
> storage like S3.
>
>
>
>
>

Mime
View raw message