hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Moore" <jamesthepi...@gmail.com>
Subject Re: Using MapReduce to do table comparing.
Date Wed, 23 Jul 2008 17:12:33 GMT
On Wed, Jul 23, 2008 at 7:33 AM, Amber <guxiaobo1982@hotmail.com> wrote:
> We have a 10 million row table exported from AS400 mainframe every day, the table is
exported as a csv text file, which is about 30GB in size, then the csv file is imported into
a RDBMS table which is dropped and recreated every day. Now we want to find how many rows
are updated during each export-import interval, the table has a primary key, so deletes and
inserts can be found using RDBMS joins quickly, but we must do a column to column comparing
in order to find the difference between rows ( about 90%) with the same primary keys. Our
goal is to find a comparing process which takes no more than 10 minutes with a 4-node cluster,
each server in which has 4 4-core 3.0 GHz CPUs, 8GB memory  and a  300G local  RAID5 array.
>
> Bellow is our current solution:
>    The old data is kept in the RDBMS with index created on the primary key, the new data
is imported into HDFS as the input file of our Map-Reduce job. Every map task connects to
the RDBMS database, and selects old data from it for every row, map tasks will generate outputs
if differences are found, and there are no reduce tasks.
>
> As you can see, with the number of concurrent map tasks increasing, the RDBMS database
will become the bottleneck, so we want to kick off the RDBMS, but we have no idea about how
to retrieve the old row with a given key quickly from HDFS files, any suggestion is welcome.

Think of map/reduce as giving you a kind of key/value lookup for free
- it just falls out of how the system works.

You don't care about the RDBMS.  It's a distraction - you're given a
set of csv files with unique keys and dates, and you need to find the
differences between them.

Say the data looks like this:

File for jul 10:
0x1,stuff
0x2,more stuff

File for jul 11:
0x1,stuff
0x2,apples
0x3,parrot

Preprocess the csv files to add dates to the values:

File for jul 10:
0x1,20080710,stuff
0x2,20080710,more stuff

File for jul 11:
0x1,20080711,stuff
0x2,20080711,apples
0x3,20080711,parrot

Feed two days worth of these files into a hadoop job.

The mapper splits these into k=0x1, v=20080710,stuff etc.

The reducer gets one or two v's per key, and each v has the date
embedded in it - that's essentially your lookup step.

You'll end up with a system that can do compares for any two dates,
and could easily be expanded to do all sorts of deltas across these
files.

The preprocess-the-files-to-add-a-date can probably be included as
part of your mapper and isn't really a separate step - just depends on
how easy it is to use one of the off-the-shelf mappers with your data.
 If it turns out to be its own step, it can become a very simple
hadoop job.

-- 
James Moore | james@restphone.com
Ruby and Ruby on Rails consulting
blog.restphone.com

Mime
View raw message