incubator-hcatalog-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Estes <james.es...@gmail.com>
Subject Unsubscribe
Date Tue, 25 Dec 2012 20:59:23 GMT
Unsubscribe
On Dec 23, 2012 7:01 PM, "Travis Crawford" <traviscrawford@gmail.com> wrote:

> Hey Tim -
>
> Using an external table comes to mind. You could write the merged
> partition to a new directory, remove metadata for "old" partitions
> that have been merged, then register the new merged partition.
>
> Existing queries would not be affected because the metastore is only
> contacted for inputs when starting the job. Sometime later you could
> garbage collect data files not referenced in the metastore. New
> queries would of course use the new files.
>
> This is similar to deleting files on an inode-based filesystem,
> replacing a time-based garbage collector for reference counting.
>
> --travis
>
>
> On Sun, Dec 23, 2012 at 11:42 AM, Timothy Potter <thelabdude@gmail.com>
> wrote:
> > I'm reaching out for some advice on how to implement a date-based
> partition
> > scheme where every 30 days or so we merge many smaller partitions into
> > larger partitions. For example, our pipeline creates 6 - 4hr partitions
> each
> > day. After about 30 days, I'd like to combine 3 partitions to make fewer
> 12
> > hr partitions.
> >
> > Here's an example of our partitions:
> >
> > .../datetime_partition=2012-12-20_0400
> > .../datetime_partition=2012-12-20_0800
> > .../datetime_partition=2012-12-20_1200
> > .../datetime_partition=2012-12-20_1600
> > .../datetime_partition=2012-12-21_0000
> >
> > after the merge, I'd like to end up with two larger partitions
> containing 12
> > hours of data vs. 4:
> >
> > .../datetime_partition=2012-12-20_1200
> > .../datetime_partition=2012-12-21_0000
> >
> > If I merged 04, 08, and 12 into a 12hr block, then the partition label
> > should still be "2012-12-20_1200" but that conflicts with the existing
> 4hr
> > partition. Does anything exist in HCatalog / Hive world to help with
> > partition archiving like this? I'd like a process that doesn't impact
> > running jobs that are still reading data from the partitions being
> merged.
> >
> > Of course, I can see how to do it by writing to another table, but that
> > would require some UNION'ing across tables in my Pig scripts. I could
> also
> > see how to create the merged partition in temp space in HDFS, clean-out
> the
> > existing partition and then write from temp back to HCatalog (which is
> what
> > I'm doing now).
> >
> > I guess it boils down to needing an atomic "replace" partition during the
> > writing of the larger merged partition.
> >
> > Thanks.
> > Tim
>

Mime
View raw message