Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D8C051091A for ; Wed, 14 Aug 2013 17:07:01 +0000 (UTC) Received: (qmail 78566 invoked by uid 500); 14 Aug 2013 17:06:59 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 78498 invoked by uid 500); 14 Aug 2013 17:06:52 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 78469 invoked by uid 99); 14 Aug 2013 17:06:51 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Aug 2013 17:06:51 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of lordjoe2000@gmail.com designates 209.85.214.50 as permitted sender) Received: from [209.85.214.50] (HELO mail-bk0-f50.google.com) (209.85.214.50) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Aug 2013 17:06:42 +0000 Received: by mail-bk0-f50.google.com with SMTP id mz11so2913235bkb.9 for ; Wed, 14 Aug 2013 10:06:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=B9CCV8MzLyS4/YmK6L4lZbcbF3C5t5e6lXKrfJgKpJw=; b=HhB0VgWhRvNBSeFU8VDG3/A0uOx42XWArqAAGTca4dtfC0xK9xH8eFA4/rLttYKlKa yPEbiI9h53BkB8FSII4fFsnjkpT0d3Lp72G1FWD4J58wpl6HUoKGzDFmUwSvjvMVOzhX 6bO/o/XbXG+cSlb/nTTmapOq8k/DG8DiiVquDmpFmOu3M7J8bGEtgDFUrGSaLCbhyPo8 7CfL8BnBX0WQUpD/hS/qzgmLOQHB7qInD1TxsJQAMj/vIdTDgYU/fVfroZBisud+lr+o NxtCEwTm8RvJ/Z1A7iMEZOxe1jTGhDGqTqshIXgzO5s3QiI18Mx35+LNJRGI8Sg7r6DK pmgw== MIME-Version: 1.0 X-Received: by 10.204.225.12 with SMTP id iq12mr7801622bkb.4.1376499981062; Wed, 14 Aug 2013 10:06:21 -0700 (PDT) Received: by 10.204.18.148 with HTTP; Wed, 14 Aug 2013 10:06:20 -0700 (PDT) Date: Wed, 14 Aug 2013 10:06:20 -0700 Message-ID: Subject: How do I perform a scalable cartesian product From: Steve Lewis To: mapreduce-user , Steve Lewis Content-Type: multipart/alternative; boundary=485b3970d1ca67d34604e3eb60e1 X-Virus-Checked: Checked by ClamAV on apache.org --485b3970d1ca67d34604e3eb60e1 Content-Type: text/plain; charset=ISO-8859-1 I have the problem of performing a operation of a data set on itself. Assume, for example, that I have a list of people and their addresses and for each person I want the ten closest members of the set. (this is not the problem but illustrated critical aspects). I know that the ten closest people will be in the same zipcode or a neighboring zip code. This means unless the database is very large I can have the mapper send every person out with keys representing their zipcode and also keys representing the neighboring zip codes. In the reducer I can keep all people in memory and compute distances between them (assume the distance computation is slightly expensive). The problem is that this approach will not scale - eventually the number of people assigned to a zip code will exceed memory. In the current problem the number of "people" is about 100 million and doubling every 6 months. The size of a "zipcode" requires keeping about 100,000 items in memory - doable today but marginal in terms of future growth. Are there other ways to solve the problem. I considered keeping a random subset, finding the closest in that subset and then repeating with different random subsets. The solution of midifying the splitter to generate all pairs https://github.com/adamjshook/mapreducepatterns/blob/master/MRDP/src/main/java/mrdp/ch5/CartesianProduct.java will not work for a dataset with 100 million items Any bright ideas? --485b3970d1ca67d34604e3eb60e1 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
=A0 I have the problem of performing a operation of a= data set on itself.

=A0 =A0Assume, for example, t= hat I have a list of people and their addresses and for each person I want = the ten closest members of the set. (this is not the problem but illustrate= d critical aspects). I know that the ten closest people will be in the same= zipcode or a neighboring zip code. This means unless the database is very = large I can have the mapper send every person out with keys representing = =A0their zipcode and also keys representing the neighboring zip codes. In t= he reducer I can keep all people in memory and compute distances between th= em (assume the distance computation is slightly expensive).
=A0 =A0The problem is that this approach will not scale - eventually t= he number of people assigned to a zip code will exceed memory. In the curre= nt problem the number of "people" is about 100 million and doubli= ng every 6 months. The size of a "zipcode" requires keeping about= 100,000 items in memory - doable today but marginal in terms of future gro= wth.
=A0 =A0Are there other ways to solve the problem. I considered keeping= a random subset, finding the closest in that subset and then repeating wit= h different random subsets. The solution of midifying the splitter to gener= ate all pairs=A0https://gith= ub.com/adamjshook/mapreducepatterns/blob/master/MRDP/src/main/java/mrdp/ch5= /CartesianProduct.java=A0will not work for a dataset with 100 million i= tems
=A0 =A0Any bright ideas?


= =A0

--485b3970d1ca67d34604e3eb60e1--