Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of sigurd.spieckermann@gmail.com
 designates 209.85.160.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAEAKFL-0i6YW6UEL0=_EJ5m-hhE9DJzsC-yjRtp_Nu=__=fX2Q@mail.gmail.com>
References: 
 <CABVuHiiwMUUpgO2U+MfztG6YzRLC03Lmxitv+zbvVQoHbK+gGA@mail.gmail.com>
	<CAEAKFL-0i6YW6UEL0=_EJ5m-hhE9DJzsC-yjRtp_Nu=__=fX2Q@mail.gmail.com>
Date: Mon, 10 Sep 2012 12:45:38 +0200
Message-ID: 
 <CABVuHij01pyBbuGLYEucJodzRaUFm26WB0jMsMiKdPZ5kLpjiA@mail.gmail.com>
Subject: Re: Reading from HDFS from inside the mapper
From: Sigurd Spieckermann <sigurd.spieckermann@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=047d7b33d92088d4a804c956a865

--047d7b33d92088d4a804c956a865
Content-Type: text/plain; charset=ISO-8859-1

I checked DistributedCache, but in general I have to assume that none of
the datasets fits in memory... That's why I was considering map-side join,
but by default it doesn't fit to my problem. I could probably get it to
work though, but I would have to enforce the requirements of the map-side
join.

2012/9/10 Hemanth Yamijala <yhemanth@thoughtworks.com>

> Hi,
>
> You could check DistributedCache (
> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache).
> It would allow you to distribute data to the nodes where your tasks are run.
>
> Thanks
> Hemanth
>
>
> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann <
> sigurd.spieckermann@gmail.com> wrote:
>
>> Hi,
>>
>> I would like to perform a map-side join of two large datasets where
>> dataset A consists of m*n elements and dataset B consists of n elements.
>> For the join, every element in dataset B needs to be accessed m times. Each
>> mapper would join one element from A with the corresponding element from B.
>> Elements here are actually data blocks. Is there a performance problem (and
>> difference compared to a slightly modified map-side join using the
>> join-package) if I set dataset A as the map-reduce input and load the
>> relevant element from dataset B directly from the HDFS inside the mapper? I
>> could store the elements of B in a MapFile for faster random access. In the
>> second case without the join-package I would not have to partition the
>> datasets manually which would allow a bit more flexibility, but I'm
>> wondering if HDFS access from inside a mapper is strictly bad. Also, does
>> Hadoop have a cache for such situations by any chance?
>>
>> I appreciate any comments!
>>
>> Sigurd
>>
>
>

--047d7b33d92088d4a804c956a865
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I checked DistributedCache, but in general I have to assume that none of th=
e datasets fits in memory... That&#39;s why I was considering map-side join=
, but by default it doesn&#39;t fit to my problem. I could probably get it =
to work though, but I would have to enforce the requirements of the map-sid=
e join.<br>
<br><div class=3D"gmail_quote">2012/9/10 Hemanth Yamijala <span dir=3D"ltr"=
>&lt;<a href=3D"mailto:yhemanth@thoughtworks.com" target=3D"_blank">yhemant=
h@thoughtworks.com</a>&gt;</span><br><blockquote class=3D"gmail_quote" styl=
e=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hi,<div><br></div><div>You could check DistributedCache (<a href=3D"http://=
hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache"=
 target=3D"_blank">http://hadoop.apache.org/common/docs/stable/mapred_tutor=
ial.html#DistributedCache</a>). It would allow you to distribute data to th=
e nodes where your tasks are run.</div>

<div><br></div><div>Thanks</div><div><span class=3D"HOEnZb"><font color=3D"=
#888888">Hemanth</font></span><div><div class=3D"h5"><br><br><div class=3D"=
gmail_quote">On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann <span dir=
=3D"ltr">&lt;<a href=3D"mailto:sigurd.spieckermann@gmail.com" target=3D"_bl=
ank">sigurd.spieckermann@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hi,<br><br>I would like to perform a map-sid=
e join of two large datasets where dataset A consists of m*n elements and d=
ataset B consists of n elements. For the join, every element in dataset B n=
eeds to be accessed m times. Each mapper would join one element from A with=
 the corresponding element from B. Elements here are actually data blocks. =
Is there a performance problem (and difference compared to a slightly modif=
ied map-side join using the join-package) if I set dataset A as the map-red=
uce input and load the relevant element from dataset B directly from the HD=
FS inside the mapper? I could store the elements of B in a MapFile for fast=
er random access. In the second case without the join-package I would not h=
ave to partition the datasets manually which would allow a bit more flexibi=
lity, but I&#39;m wondering if HDFS access from inside a mapper is strictly=
 bad. Also, does Hadoop have a cache for such situations by any chance?<br>


<br>I appreciate any comments!<span><font color=3D"#888888"><br><br>Sigurd<=
br>
</font></span></blockquote></div><br></div></div></div>
</blockquote></div><br>

--047d7b33d92088d4a804c956a865--