Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 13C7F96C0 for ; Mon, 10 Sep 2012 10:46:10 +0000 (UTC) Received: (qmail 83412 invoked by uid 500); 10 Sep 2012 10:46:05 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 83088 invoked by uid 500); 10 Sep 2012 10:46:04 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 83076 invoked by uid 99); 10 Sep 2012 10:46:04 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Sep 2012 10:46:04 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of sigurd.spieckermann@gmail.com designates 209.85.160.48 as permitted sender) Received: from [209.85.160.48] (HELO mail-pb0-f48.google.com) (209.85.160.48) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Sep 2012 10:45:59 +0000 Received: by pbbrq13 with SMTP id rq13so2883769pbb.35 for ; Mon, 10 Sep 2012 03:45:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=ukSgeJ2CUpuY1SduvNcGZ3aWlV8efQLIxwS8PzDZihY=; b=DAzPFHRRJFyCY8mMJKl2Cx0HBAhnkAZm0sdMiGdtmAoC1J4VAgIxAZjc4/7Z+DfyGC L/WDfy6zNmEXGMS+KSzbhbnu/H0ksuQC77R105WPhf3vldjgKtX/ZrH1DTSSSSh0aLhy 7dYNhLalInY1re31BTcc9TZqUmfdRttr+HVxd1Ic79A2FLfujAsSP8XIcn/1iHKScn3a U0Hk/U0vhqH2g+6yXJ+9DkX9Ms126uI5i2VmIe2vP38BJEBzSjT+uDocfT+eSu7wNYT4 Ayt1KI12ijdKkGVUlp65/e/GXG4l5bTsGwstSEaIp9Y9KHgtEdgoddAZGm6BATQlZ4nF 6LPQ== MIME-Version: 1.0 Received: by 10.68.232.194 with SMTP id tq2mr5551040pbc.111.1347273938752; Mon, 10 Sep 2012 03:45:38 -0700 (PDT) Received: by 10.68.30.74 with HTTP; Mon, 10 Sep 2012 03:45:38 -0700 (PDT) In-Reply-To: References: Date: Mon, 10 Sep 2012 12:45:38 +0200 Message-ID: Subject: Re: Reading from HDFS from inside the mapper From: Sigurd Spieckermann To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=047d7b33d92088d4a804c956a865 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b33d92088d4a804c956a865 Content-Type: text/plain; charset=ISO-8859-1 I checked DistributedCache, but in general I have to assume that none of the datasets fits in memory... That's why I was considering map-side join, but by default it doesn't fit to my problem. I could probably get it to work though, but I would have to enforce the requirements of the map-side join. 2012/9/10 Hemanth Yamijala > Hi, > > You could check DistributedCache ( > http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache). > It would allow you to distribute data to the nodes where your tasks are run. > > Thanks > Hemanth > > > On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann < > sigurd.spieckermann@gmail.com> wrote: > >> Hi, >> >> I would like to perform a map-side join of two large datasets where >> dataset A consists of m*n elements and dataset B consists of n elements. >> For the join, every element in dataset B needs to be accessed m times. Each >> mapper would join one element from A with the corresponding element from B. >> Elements here are actually data blocks. Is there a performance problem (and >> difference compared to a slightly modified map-side join using the >> join-package) if I set dataset A as the map-reduce input and load the >> relevant element from dataset B directly from the HDFS inside the mapper? I >> could store the elements of B in a MapFile for faster random access. In the >> second case without the join-package I would not have to partition the >> datasets manually which would allow a bit more flexibility, but I'm >> wondering if HDFS access from inside a mapper is strictly bad. Also, does >> Hadoop have a cache for such situations by any chance? >> >> I appreciate any comments! >> >> Sigurd >> > > --047d7b33d92088d4a804c956a865 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I checked DistributedCache, but in general I have to assume that none of th= e datasets fits in memory... That's why I was considering map-side join= , but by default it doesn't fit to my problem. I could probably get it = to work though, but I would have to enforce the requirements of the map-sid= e join.

2012/9/10 Hemanth Yamijala <yhemant= h@thoughtworks.com>
Hi,

You could check DistributedCache (http://hadoop.apache.org/common/docs/stable/mapred_tutor= ial.html#DistributedCache). It would allow you to distribute data to th= e nodes where your tasks are run.

Thanks
Hemanth


On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann <sigurd.spieckermann@gmail.com> wrote:
Hi,

I would like to perform a map-sid= e join of two large datasets where dataset A consists of m*n elements and d= ataset B consists of n elements. For the join, every element in dataset B n= eeds to be accessed m times. Each mapper would join one element from A with= the corresponding element from B. Elements here are actually data blocks. = Is there a performance problem (and difference compared to a slightly modif= ied map-side join using the join-package) if I set dataset A as the map-red= uce input and load the relevant element from dataset B directly from the HD= FS inside the mapper? I could store the elements of B in a MapFile for fast= er random access. In the second case without the join-package I would not h= ave to partition the datasets manually which would allow a bit more flexibi= lity, but I'm wondering if HDFS access from inside a mapper is strictly= bad. Also, does Hadoop have a cache for such situations by any chance?

I appreciate any comments!

Sigurd<= br>


--047d7b33d92088d4a804c956a865--