hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Kimball <aa...@cloudera.com>
Subject Re: How to apply RDBMS table updates and deletes into Hadoop
Date Tue, 08 Jun 2010 22:58:50 GMT
I think that this might be the way to go. In general, folding updates and
deletes into datasets is a difficult problem due to the append-only nature
of datasets.

Something that might help you here is to partition your tables in Hive based
on some well-distributed key. Then if you have a relatively small number of
partitions affected by an incremental import (perhaps more recently-imported
records are more likely to be updated? in this case, partition the tables by
the month/week you imported them?) you can only perform the fold-in of the
new deltas on the affected partitions. This should be much faster than a
full table scan.

Have you seen the Sqoop tool? It handles imports and exports between HDFS
(and Hive) and RDBMS systems --  but currently can only import new records
(and subsequent INSERTs); it can't handle updates/deletes. Sqoop is
available at http://github.com/cloudera/sqoop -- it doesn't run on Apache
0.20.3, but works on CDH (Cloudera's Distribution for Hadoop) and Hadoop

This sort of capability is something I'm really interested in adding to
Sqoop. If you've got a well-run process for doing this, I'd really
appreciate your help adding this feature :) Send me an email off-list if
you're interested. At the very least, I'd urge you to try out the tool.

- Aaron Kimball

On Tue, Jun 8, 2010 at 8:54 PM, atreju <n.atreju@gmail.com> wrote:

> To generate smart output from base data we need to copy some base tables
> from relational database into Hadoop. Some of them are big. To dump the
> entire table into Hadoop everyday is not an option since there are like 30+
> tables and each would take several hours.
> The methodology that we approached is to get the entire table dump first.
> Then each day or every 4-6 hours get only insert/update/delete since the
> last copy from RDBMS (based on a date field in the table). Using Hive do
> outer join + union the new data with existing data and write into a new
> file. For example, if there are a 100 rows in Hadoop, and in RDBMS 3 records
> inserted, 2 records updated and 1 deleted since the last Hadoop copy, then
> the Hive query will get 97 of the not changed data + 3 inserts + 2 updates
> and write into a new file. The other applications like Pig or Hive will pick
> the most recent file to use when selecting/loading data from those base
> table data files.
> This logic is working fine in lower environments for small size tables.
> With production data, for about 30GB size table, the incremental
> re-generation of the file in Hadoop is still taking several hours. I tried
> using zipped version and it took even longer time. I am not convinced that
> this is the best we can do to handle updates and deletes since we had to
> re-write 29GB unchanged data of the 30GB file again into a new file. ...and
> this is not the biggest table.
> I am thinking that this should be problem for many companies. What are the
> other approaches to apply updates and deletes on base tables to the
> Hadoop data files?
> We have 4 data nodes and using version 20.3.
> Thanks!

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message