incubator-hcatalog-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Timothy Potter <thelabd...@gmail.com>
Subject looking for advice on partition archiving
Date Sun, 23 Dec 2012 19:42:45 GMT
I'm reaching out for some advice on how to implement a date-based partition
scheme where every 30 days or so we merge many smaller partitions into
larger partitions. For example, our pipeline creates 6 - 4hr partitions
each day. After about 30 days, I'd like to combine 3 partitions to make
fewer 12 hr partitions.

Here's an example of our partitions:

.../datetime_partition=2012-12-20_0400
.../datetime_partition=2012-12-20_0800
.../datetime_partition=2012-12-20_1200
.../datetime_partition=2012-12-20_1600
.../datetime_partition=2012-12-21_0000

after the merge, I'd like to end up with two larger partitions containing
12 hours of data vs. 4:

.../datetime_partition=2012-12-20_1200
.../datetime_partition=2012-12-21_0000

If I merged 04, 08, and 12 into a 12hr block, then the partition label
should still be "2012-12-20_1200" but that conflicts with the existing 4hr
partition. Does anything exist in HCatalog / Hive world to help with
partition archiving like this? I'd like a process that doesn't impact
running jobs that are still reading data from the partitions being merged.

Of course, I can see how to do it by writing to another table, but that
would require some UNION'ing across tables in my Pig scripts. I could also
see how to create the merged partition in temp space in HDFS, clean-out the
existing partition and then write from temp back to HCatalog (which is what
I'm doing now).

I guess it boils down to needing an atomic "replace" partition during the
writing of the larger merged partition.

Thanks.
Tim

Mime
View raw message