hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amber" <guxiaobo1...@hotmail.com>
Subject Re: Using MapReduce to do table comparing.
Date Thu, 24 Jul 2008 15:03:20 GMT
Yes, I think this is the simplest method , but there are problems too:

1. The reduce stage wouldn't begin until the map stage ends, by when we have done a two table
scanning, and the comparing will take almost the same time, because about 90% of intermediate
<key, value> pairs will have two values and different keys, if I can specify a number
n, by when there are n intermediate pairs with the same key the reduce tasks start, that will
be better. In my case I will set the magic number to 2.

2. I am not sure about how Hadoop stores intermediate <key, value> pairs, we would not
afford it as data volume increasing if it is kept in memory.

--------------------------------------------------
From: "James Moore" <jamesthepiper@gmail.com>
Sent: Thursday, July 24, 2008 1:12 AM
To: <core-user@hadoop.apache.org>
Subject: Re: Using MapReduce to do table comparing.

> On Wed, Jul 23, 2008 at 7:33 AM, Amber <guxiaobo1982@hotmail.com> wrote:
>> We have a 10 million row table exported from AS400 mainframe every day, the table
is exported as a csv text file, which is about 30GB in size, then the csv file is imported
into a RDBMS table which is dropped and recreated every day. Now we want to find how many
rows are updated during each export-import interval, the table has a primary key, so deletes
and inserts can be found using RDBMS joins quickly, but we must do a column to column comparing
in order to find the difference between rows ( about 90%) with the same primary keys. Our
goal is to find a comparing process which takes no more than 10 minutes with a 4-node cluster,
each server in which has 4 4-core 3.0 GHz CPUs, 8GB memory  and a  300G local  RAID5 array.
>>
>> Bellow is our current solution:
>>    The old data is kept in the RDBMS with index created on the primary key, the new
data is imported into HDFS as the input file of our Map-Reduce job. Every map task connects
to the RDBMS database, and selects old data from it for every row, map tasks will generate
outputs if differences are found, and there are no reduce tasks.
>>
>> As you can see, with the number of concurrent map tasks increasing, the RDBMS database
will become the bottleneck, so we want to kick off the RDBMS, but we have no idea about how
to retrieve the old row with a given key quickly from HDFS files, any suggestion is welcome.
> 
> Think of map/reduce as giving you a kind of key/value lookup for free
> - it just falls out of how the system works.
> 
> You don't care about the RDBMS.  It's a distraction - you're given a
> set of csv files with unique keys and dates, and you need to find the
> differences between them.
> 
> Say the data looks like this:
> 
> File for jul 10:
> 0x1,stuff
> 0x2,more stuff
> 
> File for jul 11:
> 0x1,stuff
> 0x2,apples
> 0x3,parrot
> 
> Preprocess the csv files to add dates to the values:
> 
> File for jul 10:
> 0x1,20080710,stuff
> 0x2,20080710,more stuff
> 
> File for jul 11:
> 0x1,20080711,stuff
> 0x2,20080711,apples
> 0x3,20080711,parrot
> 
> Feed two days worth of these files into a hadoop job.
> 
> The mapper splits these into k=0x1, v=20080710,stuff etc.
> 
> The reducer gets one or two v's per key, and each v has the date
> embedded in it - that's essentially your lookup step.
> 
> You'll end up with a system that can do compares for any two dates,
> and could easily be expanded to do all sorts of deltas across these
> files.
> 
> The preprocess-the-files-to-add-a-date can probably be included as
> part of your mapper and isn't really a separate step - just depends on
> how easy it is to use one of the off-the-shelf mappers with your data.
> If it turns out to be its own step, it can become a very simple
> hadoop job.
> 
> -- 
> James Moore | james@restphone.com
> Ruby and Ruby on Rails consulting
> blog.restphone.com
> 
Mime
View raw message