hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxim Veksler <>
Subject Re: How to apply RDBMS table updates and deletes into Hadoop
Date Wed, 09 Jun 2010 18:40:51 GMT

On Wed, Jun 9, 2010 at 9:26 PM, Edward Capriolo <>wrote:
> On Tue, Jun 8, 2010 at 2:54 PM, atreju <> wrote:
>> To generate smart output from base data we need to copy some base tables
>> from relational database into Hadoop. Some of them are big. To dump the
>> entire table into Hadoop everyday is not an option since there are like 30+
>> tables and each would take several hours.
>> The methodology that we approached is to get the entire table dump first.
>> Then each day or every 4-6 hours get only insert/update/delete since the
>> last copy from RDBMS (based on a date field in the table). Using Hive do
>> outer join + union the new data with existing data and write into a new
>> file. For example, if there are a 100 rows in Hadoop, and in RDBMS 3 records
>> inserted, 2 records updated and 1 deleted since the last Hadoop copy, then
>> the Hive query will get 97 of the not changed data + 3 inserts + 2 updates
>> and write into a new file. The other applications like Pig or Hive will pick
>> the most recent file to use when selecting/loading data from those base
>> table data files.
This solution is very interesting.

Could you please further describe the logic for filtering out the deleted
record and how do you handle UPDATE for existing records in Hive (hadoop

Thank you,

View raw message