hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: How to apply RDBMS table updates and deletes into Hadoop
Date Wed, 09 Jun 2010 23:56:51 GMT
When hive is running the map-reduce job, how do we handle concurrent
update/deletion/insertion ?

On Wed, Jun 9, 2010 at 3:32 PM, atreju <n.atreju@gmail.com> wrote:

> As an ideal solution, I have a suggestion to Hive contributors to make it
> look like Hive is doing insert/update/delete:
>
>
> This will require a special type of table creation syntax. I will call it
> as
> "UPDATABLE TABLE". The table will have 3 special columns that are defined
> in
> the create script:
> 1. The primary key column. (let's say: col_pk)
> 2. BIGINT type date column that shows the ms from Jan 1st, 1970 to actual
> data manipulation date/time in RDMBS. (dml_date)
> 3. TINYINT or BOOLEAN type column that will store 0 if the record is
> deleted
> and 1 if it is inserted or updated. (dml_action)
>
> This will require the RDBMS table to have PK and last update date column
> and
> deletes recorded in some other table by pk and date.
>
> On Day 1, the entire table is put into Hadoop, with addition of 2 extra
> columns: dml_date (bigint) and dml_action.
>
> On Day 2, we first find the max of dml_date from Hive table. Then we query
> from RDBMS inserts/updates/deletes since that date/time and write into a
> file with the correct dml_date/dml_action. The file goes to the same folder
> that our Hive table is in.
>
> Right now, if on Day 1 we had 100 rows and on Day 2, 10 rows a regular Hive
> table would show 110 rows. But, since this is a special table (UPDATABLE
> TABLE), every time this table is queried in Hive, Hive first run a
> map-reduce that would find the most recent (max(dml_date)) row per pk
> (group
> by col_pk) that is not deleted (dml_action!=0) and use that output in the
> user's query. That is the big idea!!
>
> Hive can have Insert/Update/Delete commands that would do nothing but
> create
> a file with rows of manipulated data with correct date and action.
>
> There can be a special "flush" kind of command that runs the MR and
> replaces
> all files in the table directory with single file. That can run weekly,
> monthly or may be after each time dml data received from RDBMS.
>
> Sqoop can have Hive interface that saves certain table attributes like pk
> column, RDBMS connection info,... and with one command from Hive, the Hive
> table gets updated from RDBMS....
>
> What do you think?
>
>
>
> On Tue, Jun 8, 2010 at 3:58 PM, Aaron Kimball <aaron@cloudera.com> wrote:
>
> > I think that this might be the way to go. In general, folding updates and
> > deletes into datasets is a difficult problem due to the append-only
> nature
> > of datasets.
> >
> > Something that might help you here is to partition your tables in Hive
> > based on some well-distributed key. Then if you have a relatively small
> > number of partitions affected by an incremental import (perhaps more
> > recently-imported records are more likely to be updated? in this case,
> > partition the tables by the month/week you imported them?) you can only
> > perform the fold-in of the new deltas on the affected partitions. This
> > should be much faster than a full table scan.
> >
> > Have you seen the Sqoop tool? It handles imports and exports between HDFS
> > (and Hive) and RDBMS systems --  but currently can only import new
> records
> > (and subsequent INSERTs); it can't handle updates/deletes. Sqoop is
> > available at http://github.com/cloudera/sqoop -- it doesn't run on
> Apache
> > 0.20.3, but works on CDH (Cloudera's Distribution for Hadoop) and Hadoop
> > 0.21/trunk.
> >
> > This sort of capability is something I'm really interested in adding to
> > Sqoop. If you've got a well-run process for doing this, I'd really
> > appreciate your help adding this feature :) Send me an email off-list if
> > you're interested. At the very least, I'd urge you to try out the tool.
> >
> > Cheers,
> > - Aaron Kimball
> >
> >
> > On Tue, Jun 8, 2010 at 8:54 PM, atreju <n.atreju@gmail.com> wrote:
> >
> >> To generate smart output from base data we need to copy some base tables
> >> from relational database into Hadoop. Some of them are big. To dump the
> >> entire table into Hadoop everyday is not an option since there are like
> 30+
> >> tables and each would take several hours.
> >>
> >> The methodology that we approached is to get the entire table dump
> first.
> >> Then each day or every 4-6 hours get only insert/update/delete since the
> >> last copy from RDBMS (based on a date field in the table). Using Hive do
> >> outer join + union the new data with existing data and write into a new
> >> file. For example, if there are a 100 rows in Hadoop, and in RDBMS 3
> records
> >> inserted, 2 records updated and 1 deleted since the last Hadoop copy,
> then
> >> the Hive query will get 97 of the not changed data + 3 inserts + 2
> updates
> >> and write into a new file. The other applications like Pig or Hive will
> pick
> >> the most recent file to use when selecting/loading data from those base
> >> table data files.
> >>
> >> This logic is working fine in lower environments for small size tables.
> >> With production data, for about 30GB size table, the incremental
> >> re-generation of the file in Hadoop is still taking several hours. I
> tried
> >> using zipped version and it took even longer time. I am not convinced
> that
> >> this is the best we can do to handle updates and deletes since we had to
> >> re-write 29GB unchanged data of the 30GB file again into a new file.
> ...and
> >> this is not the biggest table.
> >>
> >> I am thinking that this should be problem for many companies. What are
> the
> >> other approaches to apply updates and deletes on base tables to the
> >> Hadoop data files?
> >>
> >> We have 4 data nodes and using version 20.3.
> >>
> >> Thanks!
> >>
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message