hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bejoy Ks <>
Subject Re: Hive Dynamic Partions - How to avoid overwrite
Date Tue, 04 Oct 2011 18:29:35 GMT
Thanks Vaibhav for the great response.
Definitely it is a great approach, even my thought went in the same direction. I'm doing a
daily load into my hive partitioned table and I anticipate it to be a performance breaker
with lesser data in each partitions. Basically my hive jobs/queries is gonna be centered along
a single column, say country in my example. So my ultimate goal here is to query data out
of my partitioned table with least over head, ie better if i don't have to trigger any map
reduce. A plain data fetch is my ultimate goal with this implementation. So I wanted to see
if there is any way out in that direction. Bottom line is I want the Dynamic Partition insert
queries to behave this way
	* Create a partition based on source data is not exists and then write the data in there
	* If a partition already exists don't overwrite the same, but just add on the new data in
another file in the same dir that denotes the partition
So Is there a way to achieve this? Isn't this a common requirement on data warehousing and
why we don't have a work around in hive?
It'd be great to get the valuable inputs from all the hive experts on this scenario. Is there
any JIRA open for this, If not I'd like to file one if implementing such a requirement is
feasible in hive.

Thanks and Regards

From: "Aggarwal, Vaibhav" <>
To: "" <>; Bejoy Ks <>
Sent: Tuesday, October 4, 2011 10:48 PM
Subject: RE: Hive Dynamic Partions - How to avoid overwrite

You can choose to partition by (country, date).
In this case you move the data in a date partition within your country partition and avoid
overwriting old data.
If you choose to go this way one thing to check is that this should not result in too many
Large number of partitions have large query startup times.
From:Bejoy Ks [] 
Sent: Monday, October 03, 2011 7:02 AM
To: hive user group
Subject: Hive Dynamic Partions - How to avoid overwrite
Hi Experts
    I'm intending to use hive dynamic partition approach on my current business use case.
What I have in mind for the design is as follows.
-Load my incoming data into a non partitioned hive table (Table 1)
-Load this data into partitioned hive table using Dynamic Partitions(Table 2)
-Flush the data in Table1(Drop Table and Recreate the same)
With this series of steps my data world be ready for mining.
    This is going to a periodic process happening daily. When I searched around I came
across a concern with this approach, 'the partitions getting overwritten'. 
For example. Say my second table is partitioned based on Country and in my first load, data
is populated in the partition with country=USA. When the second time my Dynamic Partition
load/insert it is executed and the source data again contains value with country=USA, in that
case the data that is already there in the partition be overwritten with the new ones. 
Is my understanding right on this scenario? Also in such scenarios what would be recommended
approach to overcome this hurdle. Basically I want the existing data in the partition to be
preserved while new data is added on to. I can't go ahead with the static partition approach
because my data is huge and the number of partitions are also petty large.  Has some one
framed effective solutions on such scenarios with Dynamic Partition insert approach? Can some
one guide me with a suitable approach with hive for such use cases?
Thanks and Regards
View raw message