hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Black, Michael (IS)" <Michael.Bla...@ngc.com>
Subject Re: Import data from mysql
Date Mon, 10 Jan 2011 20:46:44 GMT
You need to stop looking at this as an all-or-nothing...and look at it more like real-time.
You only need to do an absolute max of 1*10,000 at a time.  And...you actually only need to
do considerably less than that with age preference and other factors for the users....and
doing the computation via a built-in in your database will prevent having to retrieve all
the data...split it...start jvms...reduce it...spit out the file...read in the file...etc....saving
lots more time than using hadoop.
You should be able to do 10,000 computations inside MySQL in less than a second.  It will
take you a minute or more to do it in hadoop.
Just do them as they occur and don't worry about the once-per-day thing.  Then you're left
with a linear growth pattern which can be overcome by using MySQL in a cluster rather than
N^2 using hadoop.
Give it a try and see how it performs for you...
The whole thing will boil down to one SQL statement where you add potential matches via the
compare function.
Something like this:
idcur = current id of add/change
select id,score(idcur,id) from people where religion='RELIGIONX' and SEX='M' and AGE BETWEEN
X and Y
You then update the match table for the users returned with some score threshold (I would
assume there's a threshold) 
I don't know if you care to elucidate your "score" as I don't see a whole lot of numeric flexibiliy
in matching people...unlees you're doing personality profiles too.
Michael D. Black
Senior Scientist
Advanced Analytics Directorate
Northrop Grumman Information Systems


From: Brian [mailto:brian.mcsweeney@gmail.com]
Sent: Mon 1/10/2011 2:00 PM
To: common-user@hadoop.apache.org
Cc: <common-user@hadoop.apache.org>
Subject: EXTERNAL:Re: Import data from mysql

Hi Michael,

that all makes total sense and I very much appreciate your help. 
Leaving the bayesian issue asside for a moment, I still think I'm 
stuck with a potentially big calculating problem, even if it is not 

For example, imagine I've got 10,000 users of each gender. If only 100 
update their preferences and another 100 join, i'm still talking about 
2 million calculations for the new/updated users to score everyone 
else and another 2 million for existing users to create scores for the 
new users.

Thus, I would greatly appreciate your opinion on whether or not using 
hadoop for this would make sense in order to parallelize the task if 
it gets too slow.

Thanks again,

  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message