hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gates <ga...@hortonworks.com>
Subject Re: Large Scale Table Reprocess
Date Fri, 26 Jul 2013 20:29:59 GMT
A table can definitely have partitions with different input formats/serdes.  We test this all
the time.  

Assuming your old data doesn't stay for ever and most of your queries are on more recent data
(which is usually the case) I'd advise you to not reprocess any data, just alter the table
to store new partitions in ORC.  Then with time you'll slowly transition the table to ORC.
 This avoids all the issues you noted.  And since most queries probably only access recent
data you'll see speed ups soon after the switch.

Alan.

On Jul 25, 2013, at 4:45 PM, John Omernik wrote:

> Just finishing up testing with Hive 11 and ORC. Thank you to Owen and all those who have
put hard work into this. Just ORC files, when compared to RC files in Hive 9, 10, and 11 saw
a huge increase in performance, it was amazing.  That said, now we gotta reprocess. 
> 
> 
> We have a large table with lots of partitions. I'd love to be able to reprocess into
a new table, like table_orc, and then at the end of it all, just drop the original table.
That said, I see it being hard to do from a space perspective. and I will have to do partition
at a time.  But then theirs production issues, if I update a partition, insert overwrite int
the ORC table, then I have delete the original and production users will be missing data....
decisions decisions. 
> 
> So any ideas? Can a table have some partitions in one file type and other partitions
in another? That sounds scary.  Anywho, a good problem to have... that performance will be
worth it. 
> 
> 


Mime
View raw message