hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pedro Magalhaes <pedror...@gmail.com>
Subject MultithreadedMapper - Sharing Data Structure
Date Sat, 22 Aug 2015 23:38:27 GMT
I am developig a job that has 30B of records in the input path. (File A)
I need to filter these records using another file that can have 30K to 180M
of records. (File B)
So fo each record in File A, i will make a lookup in File B.
I am using distributed cache to share the File B. The problem is that if
the File B is too large (for example 180 M of records), i spend too much
time (CPU processing) allocating it in a hashmap. I make this allocation to
each map task.

In hadoop 2.X the jvm reuse was discontinued. So i am think in use
making the hashmap thread-safe, and sharing this read-only structure across
the mappers.

Is this a good approach?

View raw message