hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxime Brugidou <maxime.brugi...@gmail.com>
Subject Re: Hive Dynamic Partions - How to avoid overwrite
Date Tue, 04 Oct 2011 18:20:15 GMT
i suspect you can't do that unless you use 0.8
from the wiki:

"INSERT INTO will append to the table or partition keeping the existing data
in tact. (Note: INSERT INTO syntax is only available starting in version
0.8)"

if you don't have 0.8 then I suggest that you partition simply by day in
addition to Country so that you don't overwrite previous days. Use INSERT
OVERWRITE as usual.

Cheers,
Maxime

On Tue, Oct 4, 2011 at 8:14 PM, Bejoy Ks <bejoy_ks@yahoo.com> wrote:

> Thanks Florin for your response.
> But in the suggested approach, I'd have a concern. my partitioned table in
> course of time would hols 100ds of Terabytes of data. So every time when I'm
> loading my data from staging table into the production partitioned table and
> UNION over the same would be way too expensive.
> Is there any other workaround you feel would be suitable in my case.
>
> Thanks and Regards
> Bejoy.K.S
>
> ------------------------------
> *From:* Florin Diaconeasa <florin.diaconeasa@gmail.com>
> *To:* user@hive.apache.org; Bejoy Ks <bejoy_ks@yahoo.com>
> *Sent:* Tuesday, October 4, 2011 2:46 AM
> *Subject:* Re: Hive Dynamic Partions - How to avoid overwrite
>
> I would recommend doing the following SELECT:
>
> INSERT OVERWRITE INTO TABLE *<input_table>*
>
> SELECT * FROM
>
> (
> SELECT
>  x,y,z
> FROM *<input_table>*
> *
> *
> UNION ALL
>
> SELECT *
> FROM *<target_table>*
> *
> *
> ) allTables;
>
> Obviously, there are rules coming with UNION ALL, such as you need to
> name(user alias eventually) all the columns of each select. More on this on
> the hive wiki.
>
> Florin
>
> On Oct 3, 2011, at 5:02 PM, Bejoy Ks wrote:
>
> Hi Experts
>     I'm intending to use hive dynamic partition approach on my current
> business use case. What I have in mind for the design is as follows.
> -Load my incoming data into a non partitioned hive table (Table 1)
> -Load this data into partitioned hive table using Dynamic Partitions(Table
> 2)
> -Flush the data in Table1(Drop Table and Recreate the same)
> With this series of steps my data world be ready for mining.
>     This is going to a periodic process happening daily. When I searched
> around I came across a concern with this approach, 'the partitions getting
> overwritten'.
> For example. Say my second table is partitioned based on Country and in my
> first load, data is populated in the partition with country=USA. When the
> second time my Dynamic Partition load/insert it is executed and the source
> data again contains value with country=USA, in that case the data that is
> already there in the partition be overwritten with the new ones.
> Is my understanding right on this scenario? Also in such scenarios what
> would be recommended approach to overcome this hurdle. Basically I want the
> existing data in the partition to be preserved while new data is added on
> to. I can't go ahead with the static partition approach because my data is
> huge and the number of partitions are also petty large.  Has some one framed
> effective solutions on such scenarios with Dynamic Partition insert
> approach? Can some one guide me with a suitable approach with hive for such
> use cases?
>
> Thanks and Regards
> Bejoy.K.S
>
>
>
>
>

Mime
View raw message