##### Site index · List index
Message view
Top
From Ted Dunning <tdunn...@maprtech.com>
Subject Re: Import data from mysql
Date Sun, 09 Jan 2011 21:18:07 GMT
```You still have to knock down the quadratic cost.

Any equality checks you have in your problem can be used to limit the
problem to growing quadratically in the number of records equal by that
comparison.  That may be enough to fix things (for now).  Unfortunately
heavily skewed data are very common so this smaller quadratic will be orders
of magnitude smaller than the original, but still unscalable.  Hadoop makes
this grouping by equality much easier of course and the internal scan can be
done by conventional techniques.

Beyond that, you need to look at more interesting techniques to really make
this a viable option.

I would recommend:

- if the multiplication is part of a cosine similarity measurement, then
look at expressing it as a difference instead and bound the largest
component of the different.

- take a look at locality sensitive hashing.  This gives you an approximate
nearest neighbor solution that will allow good probabilistic bounds on the
number of cases that you miss in return of a scalable solution.  The error
bounds can be made fairly tight.  See http://www.mit.edu/~andoni/LSH/

- if you decide that LSH is the way to go, check out Mahout which has a
minhash clustering implementation.

- if you can't restate the problem as non-quadratic, then start over.
Quadratic algorithms are not scalable as Michael Black has stated

- consider tell the group more about your problem.  You get more if you give
more.

On Sun, Jan 9, 2011 at 5:26 AM, Brian McSweeney
<brian.mcsweeney@gmail.com>wrote:

> Thanks Ted,
>
> You're right but I suppose I was too brief in my initial statement. I
> should
> have said that I have to run an operation on all rows with respect to each
> other. It's not a case of just comparing them and thus sorting them so
> unfortunately I don't think this will help much. Some of the values in the
> rows have to be multiplied together, some have to be compared, some have to
> have a function run against them etc.
>
> cheers,
> Brian
>
> On Sun, Jan 9, 2011 at 8:55 AM, Ted Dunning <tdunning@maprtech.com> wrote:
>
> > It is, of course, only quadratic, even if you compare all rows to all
> other
> > rows.  You can reduce this cost to O(n log n) by ordinary sorting and you
> > can reduce further reduce the cost to O(n) using radix sort on hashes.
> >
> > Practically speaking, in either the parallel or non parallel setting try
> > sorting each batch of inputs and then doing a merge pass to find
> duplicated
> > rows.  Hashing your rows and doing the sort will make things faster if
> your
> > rows are very long or if you use radix sort.  Unless your data is vast,
> > this
> > would probably work on a single machine with no need for parallelism
> since
> > sorting billions of items would require <10 passes through your data with
> a
> > 2^16 way radix sort.
> >
> > To do this with hadoop, simply do the hashing as before and run a typical
> > word count.  Then the rows that duplicate are simply the ones with count
> >
> > 1
> > and these can be preferentially output by the reducer.
> >
> > On Sat, Jan 8, 2011 at 3:33 PM, Brian McSweeney
> > <brian.mcsweeney@gmail.com>wrote:
> >
> > > I'm a TOTAL newbie on hadoop. I have an existing webapp that has a
> > growing
> > > number of rows in a mysql database that I have to compare against one
> > > another once a day from a batch job. This is an exponential problem as
> > > every
> > > row must be compared against every other row. I was thinking of
> > > parallelizing this computation via hadoop.
> > >
> >
>
>
>
> --
> -----------------------------------------
> Brian McSweeney
>
> Technology Director
> Smarter Technology
> web: http://www.smarter.ie
> phone: +353868578212
> -----------------------------------------
>

```
Mime
• Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message