hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Lewis <lordjoe2...@gmail.com>
Subject How do I perform a scalable cartesian product
Date Wed, 14 Aug 2013 17:06:20 GMT
  I have the problem of performing a operation of a data set on itself.

   Assume, for example, that I have a list of people and their addresses
and for each person I want the ten closest members of the set. (this is not
the problem but illustrated critical aspects). I know that the ten closest
people will be in the same zipcode or a neighboring zip code. This means
unless the database is very large I can have the mapper send every person
out with keys representing  their zipcode and also keys representing the
neighboring zip codes. In the reducer I can keep all people in memory and
compute distances between them (assume the distance computation is slightly
   The problem is that this approach will not scale - eventually the number
of people assigned to a zip code will exceed memory. In the current problem
the number of "people" is about 100 million and doubling every 6 months.
The size of a "zipcode" requires keeping about 100,000 items in memory -
doable today but marginal in terms of future growth.
   Are there other ways to solve the problem. I considered keeping a random
subset, finding the closest in that subset and then repeating with
different random subsets. The solution of midifying the splitter to
generate all pairs
not work for a dataset with 100 million items
   Any bright ideas?

View raw message