hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian McSweeney <brian.mcswee...@gmail.com>
Subject Re: Import data from mysql
Date Mon, 10 Jan 2011 01:19:28 GMT
Hi Michael,

Firstly, thanks for the reply. Secondly, I have to give you credit for the
first person who has ever asked me if I want to open up my kimono a little
and also the first person on a tech list who has ever made me laugh out
loud. :)

Ok, I hear you, and you raise some very valid issues so I'll show a little
leg :)

So, my application is a dating application and my problem is with respect to
users in the system. The row comparison I was referring to is between users
in my system. Each user has a set of profile attributes (age, location,
gender, race, religion etc etc).
Each user also has a set of preferences in terms of ideal dates. In order to
determine if people are a good fit for each other, user's preferences are
compared with other user's profile attributes, and a score is created.

In an ideal world, each user would create a score for each other user. This
would be, as you have pointed out, a N^2 problem. Also, there is a baysian
factor that is applied to every user on a daily basis based on a number of
factors such as activity on the site. This is why I said that all users must
be compared on a daily basis.

So, where am I at the moment at this. I have also realised that this is an
unrealistic strategy long term as the numbers grow, therefore I have looked
at partitioning the space, so that only groups of users under a certain
limit are compared...eg, users in the same state...or under a maximum limit.

Thus, I was hoping that if I put that limit at say comparing, 1000 users
(say 1000 men and 1000 women)...thus this is 1 million ranks, then I could
push each one of these comparisons to hadoop, which could be run in parallel
and therefore quicker than running several batch comparisons of 1000 users
sequentially on one box.

I hope this makes sense and I hope I have opened up my kimono enough for you
to get a sense of what I'm talking about :)

thanks very much,

On Sun, Jan 9, 2011 at 1:51 PM, Black, Michael (IS)

> All you're doing is delaying the inevitable by going to hadoop.  There's no
> magic to hadoop.  It doens't run as fast as individual processes.  There's
> just the ability to split jobs across a cluster which works for some
> problems.  You won't even get a linear improvement in speed.
> At least I assume you don't have some
> magical-automatically-growing-forrest-of-computers.
> Do ALL the values change every day?  You still would be be better off doing
> it as updates are made.  You can multithread your application with OpenMP
> really easily and if you've got 8 cores get close to an 8X improvement with
> hardly any effort at all.
> It sounds like you have an exploding data problem which means you need to
> readdress what you're doing  so you're not in N^2 space any more.  That's
> completely untennable which you're starting to see.  You quite obviously
> cannot keep this up for long...
> So...if you want to open up your kimono a bit and show an example of what
> your'e doing maybe we can help.
> Michael D. Black
> Senior Scientist
> Advanced Analytics Directorate
> Northrop Grumman Information Systems
> ________________________________
> From: Brian McSweeney [mailto:brian.mcsweeney@gmail.com]
> Sent: Sun 1/9/2011 7:30 AM
> To: common-user@hadoop.apache.org
> Subject: EXTERNAL:Re: Import data from mysql
> Hi Michael,
> yeah, sorry, I shouldn't have said a compare as that would be a simplified
> problem. For each two rows I have to calculate a score based on multiplying
> some of the column values together, running some functions against each
> other etc. I could do this as the rows are entered into the db, cutting
> down
> the problem, however unforunately the values in the existing rows change
> every day, therefore I think the only thing to do is export the lot and run
> a job once a day to come up with the new scores. This is why I'm looking at
> hadoop as it has become too big a job doing it in a serial processing way.
> cheers,
> Brian

Brian McSweeney

Technology Director
Smarter Technology
web: http://www.smarter.ie
phone: +353868578212

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message