hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Black, Michael (IS)" <Michael.Bla...@ngc.com>
Subject RE: EXTERNAL:Re: Import data from mysql
Date Sun, 09 Jan 2011 13:51:18 GMT
All you're doing is delaying the inevitable by going to hadoop.  There's no magic to hadoop.
 It doens't run as fast as individual processes.  There's just the ability to split jobs across
a cluster which works for some problems.  You won't even get a linear improvement in speed.
At least I assume you don't have some magical-automatically-growing-forrest-of-computers.
Do ALL the values change every day?  You still would be be better off doing it as updates
are made.  You can multithread your application with OpenMP really easily and if you've got
8 cores get close to an 8X improvement with hardly any effort at all.
It sounds like you have an exploding data problem which means you need to readdress what you're
doing  so you're not in N^2 space any more.  That's completely untennable which you're starting
to see.  You quite obviously cannot keep this up for long...
So...if you want to open up your kimono a bit and show an example of what your'e doing maybe
we can help.
Michael D. Black
Senior Scientist
Advanced Analytics Directorate
Northrop Grumman Information Systems


From: Brian McSweeney [mailto:brian.mcsweeney@gmail.com]
Sent: Sun 1/9/2011 7:30 AM
To: common-user@hadoop.apache.org
Subject: EXTERNAL:Re: Import data from mysql

Hi Michael,

yeah, sorry, I shouldn't have said a compare as that would be a simplified
problem. For each two rows I have to calculate a score based on multiplying
some of the column values together, running some functions against each
other etc. I could do this as the rows are entered into the db, cutting down
the problem, however unforunately the values in the existing rows change
every day, therefore I think the only thing to do is export the lot and run
a job once a day to come up with the new scores. This is why I'm looking at
hadoop as it has become too big a job doing it in a serial processing way.


  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message