Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of lordjoe2000@gmail.com
 designates 209.85.214.50 as permitted sender)
MIME-Version: 1.0
Date: Wed, 14 Aug 2013 10:06:20 -0700
Message-ID: 
 <CALEj8eMOpSsE-AKjeLSxoH6u9sY7nCTxoJwX_5VeV3NDDuENQQ@mail.gmail.com>
Subject: How do I perform a scalable cartesian product
From: Steve Lewis <lordjoe2000@gmail.com>
To: mapreduce-user <mapreduce-user@hadoop.apache.org>,
 Steve Lewis <lordjoe2000@gmail.com>
Content-Type: multipart/alternative; boundary=485b3970d1ca67d34604e3eb60e1

--485b3970d1ca67d34604e3eb60e1
Content-Type: text/plain; charset=ISO-8859-1

  I have the problem of performing a operation of a data set on itself.

   Assume, for example, that I have a list of people and their addresses
and for each person I want the ten closest members of the set. (this is not
the problem but illustrated critical aspects). I know that the ten closest
people will be in the same zipcode or a neighboring zip code. This means
unless the database is very large I can have the mapper send every person
out with keys representing  their zipcode and also keys representing the
neighboring zip codes. In the reducer I can keep all people in memory and
compute distances between them (assume the distance computation is slightly
expensive).
   The problem is that this approach will not scale - eventually the number
of people assigned to a zip code will exceed memory. In the current problem
the number of "people" is about 100 million and doubling every 6 months.
The size of a "zipcode" requires keeping about 100,000 items in memory -
doable today but marginal in terms of future growth.
   Are there other ways to solve the problem. I considered keeping a random
subset, finding the closest in that subset and then repeating with
different random subsets. The solution of midifying the splitter to
generate all pairs
https://github.com/adamjshook/mapreducepatterns/blob/master/MRDP/src/main/java/mrdp/ch5/CartesianProduct.java
will
not work for a dataset with 100 million items
   Any bright ideas?

--485b3970d1ca67d34604e3eb60e1
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>=A0 I have the problem of performing a operation of a=
 data set on itself.</div><div><br></div><div>=A0 =A0Assume, for example, t=
hat I have a list of people and their addresses and for each person I want =
the ten closest members of the set. (this is not the problem but illustrate=
d critical aspects). I know that the ten closest people will be in the same=
 zipcode or a neighboring zip code. This means unless the database is very =
large I can have the mapper send every person out with keys representing =
=A0their zipcode and also keys representing the neighboring zip codes. In t=
he reducer I can keep all people in memory and compute distances between th=
em (assume the distance computation is slightly expensive).</div>
<div>=A0 =A0The problem is that this approach will not scale - eventually t=
he number of people assigned to a zip code will exceed memory. In the curre=
nt problem the number of &quot;people&quot; is about 100 million and doubli=
ng every 6 months. The size of a &quot;zipcode&quot; requires keeping about=
 100,000 items in memory - doable today but marginal in terms of future gro=
wth.</div>
<div>=A0 =A0Are there other ways to solve the problem. I considered keeping=
 a random subset, finding the closest in that subset and then repeating wit=
h different random subsets. The solution of midifying the splitter to gener=
ate all pairs=A0<a href=3D"https://github.com/adamjshook/mapreducepatterns/=
blob/master/MRDP/src/main/java/mrdp/ch5/CartesianProduct.java">https://gith=
ub.com/adamjshook/mapreducepatterns/blob/master/MRDP/src/main/java/mrdp/ch5=
/CartesianProduct.java</a>=A0will not work for a dataset with 100 million i=
tems</div>
<div>=A0 =A0Any bright ideas?</div><br clear=3D"all"><div><br></div><div>=
=A0<br><br></div>
</div>

--485b3970d1ca67d34604e3eb60e1--