From Luke Forehand <>
Date Tue, 29 Mar 2011 14:36:46 GMT

Our hive table import process uses a dynamic partition insert into a temporary table, then
the resulting sequence files are loaded into the master table using LOAD DATA INPATH because
we want the data online immediately for querying.  The data that is loaded does not overwrite
files already existing in the partitions so we are essentially doing an "append" to the partitions.
 Our question is, is this a bad practice, and how does this affect table sampling?  It seems
that the table sample mechanism expects as many files in the partition folder as are partition
buckets.  Doing a "compaction" of the table using INSERT OVERWRITE to re-write the partitions
fixes the table sampling problem, but we would like to avoid the expensive write.  Are there
better ways to accomplish our goal of putting data online quickly, and preserve the ability
to table sample?

Luke Forehand

