hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian McSweeney <brian.mcswee...@gmail.com>
Subject Re: Import data from mysql
Date Fri, 14 Jan 2011 20:24:12 GMT
Hi Mark,

what a very interesting email ! And it sounds like you are writing a very
interesting and timely book. I'm glad you enjoyed the thread. I did too :-)

I would love to help you all I can with your book and would be fascinated to
read the chapter you're writing that is related to my initial question. With
respect to my problem, actually, it has nothing to do with the insurance
site I have. Smarter.ie is an irish focused insurance auction site, we have
other similar sites in progress for other locations. However, with respect
to my matching problem, it is unrelated to our insurance sites and was
actually broadly as I described...I did withhold some of the details and
that may have made it slightly confusing but that is because there are some
commercially sensitive issues.

I would love to help you with your book and be happy to help with it as an
example, but I think to go further in it we should probably discuss it off
the mailing list as there is some commercially sensitive stuff in the
example and if it were to be used in your book I would want it to be
generalized. But yes, you are on the money with regards to your graph

Anyway, feel free to mail me directly at my gmail address and I'd be very
happy to help all I can.

kind regards and best of luck with the book!

On Fri, Jan 14, 2011 at 6:02 AM, Mark Kerzner <markkerzner@gmail.com> wrote:

> Brian,
> I read with fascination your thread on MySQL and Hadoop. I enjoyed your
> polite answers to every person. Your problem is interesting. Your helpers
> were brilliant. Disclaimer: I have a vested interest, as I am writing
> "Hadoop in Practice" for Manning, and I was at the beginning of chapter 3,
> "SQL Databases and Hadoop" when you asked your question. You can imagine
> that I was thrilled and stored the thread to be read later. Which is now.
> I think that your problem has two different components.
>   1. Import of MySQL data into Hadoop. This can be done with Sqoop, HIHO,
>   custom file formats on top of Hadoop API, Cascading, cascading-dbmigrate.
> I
>   imagine that you would dump the files in text format for Hadoop into
>   2. Changing and enhancing the architecture, using
>   update-only-what-changed, data grouping, or some other clever heuristics.
> I am thinking about both questions. For 1., I am planning to look at every
> one of them, then prepare a section with an example on each, because that
> is
> how the whole book is constructed. For 2., I am thinking about other
> approaches. Essentially, you have a big matrix, and you want to compute
> something similar to matrix multiplication. If so, can you normalize the
> matrix before? Or, can you express this as an optimization problem, "I am
> trying to find max number of best matches, according to some criteria, and
> do it in a reasonable time." I would not be very happy to change algorithm
> just for the purpose of optimizing the speed. At the very least, it should
> not be done on the first iteration, as this would be a case of premature
> optimization. I also wonder if graph operations, something like Pregel
> (Hama) can be useful here.
> On the subject of kimono, your site,
> http://www.smarter.ie/<http://www.smarter.ie/index.do> is
> about auctioning car insurance, and perhaps other types of insurance. Is it
> only for Europe? The site uses the Ireland domain name, ie. Also, is your
> real problem in insurance matching, and you just used dating as a metaphor?
> Why would I ask? I see in this a wonderful practical application example -
> and nothing beats practice - so I would like to describe it as a practical
> use case, in some general terms. Thus, I would like to know, so that I can
> be closer to reality.
> Thank you. Sincerely,
> Mark
> On Sat, Jan 8, 2011 at 5:33 PM, Brian McSweeney
> <brian.mcsweeney@gmail.com>wrote:
> > Hi folks,
> >
> > I'm a TOTAL newbie on hadoop. I have an existing webapp that has a
> growing
> > number of rows in a mysql database that I have to compare against one
> > another once a day from a batch job. This is an exponential problem as
> > every
> > row must be compared against every other row. I was thinking of
> > parallelizing this computation via hadoop. As such, I was thinking that
> > perhaps the first thing to look at is how to bring info from a database
> to
> > a
> > hadoop job and vise versa. I have seen the following relevant info
> >
> > https://issues.apache.org/jira/browse/HADOOP-2536
> >
> > and also
> >
> > http://architects.dzone.com/articles/tools-moving-sql-database
> >
> > any advice on what approach to use?
> >
> > cheers,
> > Brian
> >

Brian McSweeney

Technology Director
Smarter Technology
web: http://www.smarter.ie
phone: +353868578212

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message