Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 184E6E811 for ; Fri, 15 Feb 2013 21:25:45 +0000 (UTC) Received: (qmail 36095 invoked by uid 500); 15 Feb 2013 21:25:40 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 35992 invoked by uid 500); 15 Feb 2013 21:25:40 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 35982 invoked by uid 99); 15 Feb 2013 21:25:40 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 15 Feb 2013 21:25:40 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of zhangyunming1990@gmail.com designates 209.85.214.178 as permitted sender) Received: from [209.85.214.178] (HELO mail-ob0-f178.google.com) (209.85.214.178) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 15 Feb 2013 21:25:30 +0000 Received: by mail-ob0-f178.google.com with SMTP id wd20so4040090obb.37 for ; Fri, 15 Feb 2013 13:25:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:from:content-type:subject:message-id:date:to :mime-version:x-mailer; bh=sNzhXpNqZvhcWcv9KMmQNexbg/FygRXgLGLOp05M5js=; b=vRsybD0z+X14JfJ76u+N3FKxzSHLtAGw+aur85h22hMVW9AUJ1DHv9B888GOR6d93n 4cpO4l/7PcTMIx6TEovv3HBQtygn2g5dpBq7QEC22qoC9ieo7Y8lzwUSYaXo6uh/DGJd qDgyItJPQGFC8eNS4Vq29feny68H+L2S0wegIxSU7/qIq+1O0jQDUgwCzDweXx0IlB2P fapbQ6DgF73WAwBdYQ8rqIiWY9YjDVAg3CCOO4l498sZbpoilYXQl3Xm8NwdmfYLkC+3 nIjLHq9n3Z2v3dQyhoVWbQTEd2/duF6TanMq4dcyIBqDysOVyzRKaDIgJX0OmF57z3nC P51Q== X-Received: by 10.60.18.232 with SMTP id z8mr2814978oed.11.1360963509071; Fri, 15 Feb 2013 13:25:09 -0800 (PST) Received: from student-117-wless194-233.rice.edu ([128.42.222.207]) by mx.google.com with ESMTPS id rn9sm66173322obb.11.2013.02.15.13.25.07 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 15 Feb 2013 13:25:08 -0800 (PST) From: Yunming Zhang Content-Type: multipart/alternative; boundary="Apple-Mail=_FE4767F5-C9B0-4EB0-BC93-736316B95FA0" Subject: Can anyone point me to a good Map Reduce in memory Join implementation? Message-Id: <237BDF84-8907-4BE7-9EB6-F88415BAA36D@gmail.com> Date: Fri, 15 Feb 2013 15:25:06 -0600 To: "user@hadoop.apache.org" Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) X-Mailer: Apple Mail (2.1499) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_FE4767F5-C9B0-4EB0-BC93-736316B95FA0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii Hi,=20 I am trying to do some work with in memory Join Map Reduce = implementation,=20 it can be summarized as a a join between two data set, R and S, one of = them is too large to fit into memory, the other one can fit into memory = reasonably well,=20 (size of R << size of S). The typical implementation=20 1) distributes or broadcasts R to all map tasks (each mapper loads R in = memory, hashed by join key).=20 2) map (stream) over S, divide S into datums and use it as input to each = map task, 3) within each map task, for every tuple in S, look up join key in R 4) reduce computation is trivial If anyone could point me to a good implementation that I could use a = reference, that would be great. I do plan to write my own implementation, but it would be helpful to = take a look to see if there are established implementation out there,=20 Thanks Yunming= --Apple-Mail=_FE4767F5-C9B0-4EB0-BC93-736316B95FA0 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=us-ascii it can be summarized as a a join = between two data set, R and S, one of them is too large to fit into = memory, the other one can fit into memory reasonably = well, 
(size of R << size of S). The typical = implementation 
1) distributes or broadcasts R to all map tasks = (each mapper loads R in memory, hashed by join key). 
2) map (stream) = over S, divide S into datums and use it as input to each map = task,
3) = within each map task, for every tuple in S, look up join key in = R
4) = reduce computation is trivial

If anyone could point me to a = good implementation that I could use a reference, that would be = great.
I = do plan to write my own implementation, but it would be helpful to take = a look to see if there are established implementation out = there, 

Thanks
Yunming
= --Apple-Mail=_FE4767F5-C9B0-4EB0-BC93-736316B95FA0--