hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Coordination between Mapper tasks
Date Fri, 20 Mar 2009 04:06:02 GMT
Aaron makes lots of sense when he says that there are better ways to do this
lookup without making your mappers depend on each other.

But having a hadoop cluster slam a mysql farm with queries is asking for
trouble (I have tried it).  Hadoop mappers can saturate a mysql database so
*very* hard that it is a thing to behold.

There are lots of other options.  The idea of using Zookeeper to spawn a
special lookup thread on each machine isn't so bad, although I would avoid
RMI like the plague, prefering Thrift or something similar.  Having the
program that launches the map-reduce program launch a lookup cluster isn't a
bad option either (but it isn't as simple as just starting the map-reduce
program).  Another option is to use a lookup system that depends on the file
system cache for memory residency of the lookup table.

I would strongly recommend exploring a pure map-reduce solution to the
problem.  Try joining your lookup table to your map data using a preliminary
map-reduce step.  This is very easily done if you have a single lookup per
map invocation.  If you have a number of lookups, then pass through your
data producing lookup keys each with pointers back to your original record
keys, pass through your lookup table generating key value pairs.  Reduce on
lookup key and emit original key + key/value pair from the lookup table.
Make sure you eliminate duplicates key/value pairs at this point.  Reduce
that against your original data and now you have your original data with all
of the data records that the mapper needs all in one place.  You are now set
to go with your original problem except the lookup operation has been done
ahead of time.

This sounds outrageously expensive, but because all disk I/O is sequential
it can be surprisingly fast even when the intermediate data steps are quite
large.

On Thu, Mar 19, 2009 at 8:46 PM, Aaron Kimball <aaron@cloudera.com> wrote:

>
> Are you using multiple machines for your processing? Rolling your own RMI
> service to provide data to your other system seems like asking for tricky
> bugs. Why not just put the dictionary terms into a mysql database? Your
> mappers could then select against this database, pulling in data
> incrementally, and discarding data they don't need. If you configured
> memcached (like Jim suggests), then you can even get some memory-based
> performance boosts too by sharing common reads.
>
> --
Ted Dunning, CTO
DeepDyve

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message