hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Palka (TripAdvisor)" <tpa...@tripadvisor.com>
Subject Hadoop hdfs / Hive backup solution, open-sourced by TripAdvisor
Date Tue, 15 May 2012 20:50:01 GMT
At TripAdvisor we use Hadoop and Hive for our warehousing needs. Processing the daily logs
takes a long time, and re-processing them would be prohibitive.  As we couldn't find a backup
solution, we put one together.  We open sourced it in hopes that it might be useful to others
as well.  You can find it on github:

https://github.com/TAwarehouse/backup-hadoop-and-hive

The backup app traverses the hdfs filesystem looking for all files with mtime in a given range,
then copying (a'la copyToLocal) the files to  local directory.  If hdfs were to crash, you
can use "hadoop fs -copyFromLocal" to restore the filesystem contents.  The backup can be
invoked incrementally to keep updating the local copy.  Files that would be overwritten get
copied to a "preserved" area, so that older versions remain available.

This project also includes a dump of the hive schema, along with hql statements to reassociate
the tables with hdfs partitions.  This portion came in very handy when we migrated our Hive
backing db from derby to mysql.

Thanks go to Josh Patterson, Edward Capriolo, and Rapleaf for letting us use their hdfs-style
checksum, hive show-create-table, hdfs traversal code.

For more info see the README:

https://github.com/TAwarehouse/backup-hadoop-and-hive/blob/master/README.txt


tom.
tpalka<at>tripadvisor<dot>com


Mime
View raw message