Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
MIME-Version: 1.0
References: 
 <CABWDh0Yi6YoEN4PhA7Vj6RAvcBJX4_cJ+r12TYfQtYuVWgm9Qw@mail.gmail.com>
In-Reply-To: 
 <CABWDh0Yi6YoEN4PhA7Vj6RAvcBJX4_cJ+r12TYfQtYuVWgm9Qw@mail.gmail.com>
From: Harsh J <harsh@cloudera.com>
Date: Mon, 24 Aug 2015 08:07:12 +0000
Message-ID: 
 <CAOcnVr1A51AS19_j9uCWaMrzBKY5U3zys_BR2fT916EmEDcMXA@mail.gmail.com>
Subject: Re: MultithreadedMapper - Sharing Data Structure
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=001a1140269a6ab369051e0a1b2f

--001a1140269a6ab369051e0a1b2f
Content-Type: text/plain; charset=UTF-8

The MultiThreadedMapper won't solve your problem, as all it does is run
parallel maps within the same map task JVM as a non-MT one. Your data
structure won't be shared across the different map task JVMs on the host,
but just within the map tasks's own multiple threads running the map()
function over input records.

Wouldn't doing reduce-side join for larger files be much faster?

On Sun, Aug 23, 2015 at 5:08 AM Pedro Magalhaes <pedrorjbr@gmail.com> wrote:

> I am developig a job that has 30B of records in the input path. (File A)
> I need to filter these records using another file that can have 30K to
> 180M of records. (File B)
> So fo each record in File A, i will make a lookup in File B.
> I am using distributed cache to share the File B. The problem is that if
> the File B is too large (for example 180 M of records), i spend too much
> time (CPU processing) allocating it in a hashmap. I make this allocation to
> each map task.
>
> In hadoop 2.X the jvm reuse was discontinued. So i am think in use MultithreadedMapper,
> making the hashmap thread-safe, and sharing this read-only structure across
> the mappers.
>
> Is this a good approach?
>
>
>
>

--001a1140269a6ab369051e0a1b2f
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">The MultiThreadedMapper won&#39;t solve your problem, as a=
ll it does is run parallel maps within the same map task JVM as a non-MT on=
e. Your data structure won&#39;t be shared across the different map task JV=
Ms on the host, but just within the map tasks&#39;s own multiple threads ru=
nning the map() function over input records.<div><br></div><div>Wouldn&#39;=
t doing reduce-side join for larger files be much faster?</div></div><br><d=
iv class=3D"gmail_quote"><div dir=3D"ltr">On Sun, Aug 23, 2015 at 5:08 AM P=
edro Magalhaes &lt;<a href=3D"mailto:pedrorjbr@gmail.com">pedrorjbr@gmail.c=
om</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margi=
n:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">=
I am developig a job that has 30B of records in the input path. (File A)<di=
v>I need to filter these records using another file that can have 30K to 18=
0M of records. (File B)</div><div>So fo each record in File A, i will make =
a lookup in File B.</div><div>I am using distributed cache to share the Fil=
e B. The problem is that if the File B is too large (for example 180 M of r=
ecords), i spend too much time (CPU processing) allocating it in a hashmap.=
 I make this allocation to each map task.</div><div><br></div><div>In hadoo=
p 2.X the jvm reuse was discontinued. So i am think in use=C2=A0<span style=
=3D"font-family:Arial,Tahoma,Helvetica,FreeSans,sans-serif;font-size:12px;l=
ine-height:16.7999992370605px;text-align:justify">MultithreadedMapper, maki=
ng the hashmap thread-safe, and sharing this read-only structure across the=
 mappers.</span></div><div><span style=3D"font-family:Arial,Tahoma,Helvetic=
a,FreeSans,sans-serif;font-size:12px;line-height:16.7999992370605px;text-al=
ign:justify"><br></span></div><div><span style=3D"font-family:Arial,Tahoma,=
Helvetica,FreeSans,sans-serif;font-size:12px;line-height:16.7999992370605px=
;text-align:justify">Is this a good approach?</span></div><div><span style=
=3D"font-family:Arial,Tahoma,Helvetica,FreeSans,sans-serif;font-size:12px;l=
ine-height:16.7999992370605px;text-align:justify"><br></span></div><div><sp=
an style=3D"font-family:Arial,Tahoma,Helvetica,FreeSans,sans-serif;font-siz=
e:12px;line-height:16.7999992370605px;text-align:justify"><br></span></div>=
<div>=C2=A0</div></div>
</blockquote></div>

--001a1140269a6ab369051e0a1b2f--