hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From twinkle sachdeva <twinkle.sachd...@gmail.com>
Subject Re: MultithreadedMapper - Sharing Data Structure
Date Mon, 24 Aug 2015 08:46:24 GMT

We have been using the jvm reuse feature for the same reason of sharing the
same structure across multiple Map Tasks. Multithreaded Map task does that
partially, as within the multiple threads, same copy is used.

Depending upon the hardware availability, one can get the same performance.


On Mon, Aug 24, 2015 at 1:37 PM, Harsh J <harsh@cloudera.com> wrote:

> The MultiThreadedMapper won't solve your problem, as all it does is run
> parallel maps within the same map task JVM as a non-MT one. Your data
> structure won't be shared across the different map task JVMs on the host,
> but just within the map tasks's own multiple threads running the map()
> function over input records.
> Wouldn't doing reduce-side join for larger files be much faster?
> On Sun, Aug 23, 2015 at 5:08 AM Pedro Magalhaes <pedrorjbr@gmail.com>
> wrote:
>> I am developig a job that has 30B of records in the input path. (File A)
>> I need to filter these records using another file that can have 30K to
>> 180M of records. (File B)
>> So fo each record in File A, i will make a lookup in File B.
>> I am using distributed cache to share the File B. The problem is that if
>> the File B is too large (for example 180 M of records), i spend too much
>> time (CPU processing) allocating it in a hashmap. I make this allocation to
>> each map task.
>> In hadoop 2.X the jvm reuse was discontinued. So i am think in use MultithreadedMapper,
>> making the hashmap thread-safe, and sharing this read-only structure across
>> the mappers.
>> Is this a good approach?

View raw message