incubator-hcatalog-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Travis Crawford <>
Subject Re: looking for advice on partition archiving
Date Mon, 24 Dec 2012 01:00:33 GMT
Hey Tim -

Using an external table comes to mind. You could write the merged
partition to a new directory, remove metadata for "old" partitions
that have been merged, then register the new merged partition.

Existing queries would not be affected because the metastore is only
contacted for inputs when starting the job. Sometime later you could
garbage collect data files not referenced in the metastore. New
queries would of course use the new files.

This is similar to deleting files on an inode-based filesystem,
replacing a time-based garbage collector for reference counting.


On Sun, Dec 23, 2012 at 11:42 AM, Timothy Potter <> wrote:
> I'm reaching out for some advice on how to implement a date-based partition
> scheme where every 30 days or so we merge many smaller partitions into
> larger partitions. For example, our pipeline creates 6 - 4hr partitions each
> day. After about 30 days, I'd like to combine 3 partitions to make fewer 12
> hr partitions.
> Here's an example of our partitions:
> .../datetime_partition=2012-12-20_0400
> .../datetime_partition=2012-12-20_0800
> .../datetime_partition=2012-12-20_1200
> .../datetime_partition=2012-12-20_1600
> .../datetime_partition=2012-12-21_0000
> after the merge, I'd like to end up with two larger partitions containing 12
> hours of data vs. 4:
> .../datetime_partition=2012-12-20_1200
> .../datetime_partition=2012-12-21_0000
> If I merged 04, 08, and 12 into a 12hr block, then the partition label
> should still be "2012-12-20_1200" but that conflicts with the existing 4hr
> partition. Does anything exist in HCatalog / Hive world to help with
> partition archiving like this? I'd like a process that doesn't impact
> running jobs that are still reading data from the partitions being merged.
> Of course, I can see how to do it by writing to another table, but that
> would require some UNION'ing across tables in my Pig scripts. I could also
> see how to create the merged partition in temp space in HDFS, clean-out the
> existing partition and then write from temp back to HCatalog (which is what
> I'm doing now).
> I guess it boils down to needing an atomic "replace" partition during the
> writing of the larger merged partition.
> Thanks.
> Tim

View raw message