Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 93EE718684 for ; Mon, 24 Aug 2015 08:07:43 +0000 (UTC) Received: (qmail 29204 invoked by uid 500); 24 Aug 2015 08:07:38 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 29050 invoked by uid 500); 24 Aug 2015 08:07:38 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 29040 invoked by uid 99); 24 Aug 2015 08:07:38 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 24 Aug 2015 08:07:38 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 1A432ED463 for ; Mon, 24 Aug 2015 08:07:38 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.999 X-Spam-Level: ** X-Spam-Status: No, score=2.999 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 6AVcUZ99yMvM for ; Mon, 24 Aug 2015 08:07:23 +0000 (UTC) Received: from mail-io0-f175.google.com (mail-io0-f175.google.com [209.85.223.175]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 0694C20FA9 for ; Mon, 24 Aug 2015 08:07:23 +0000 (UTC) Received: by iodb91 with SMTP id b91so139494344iod.1 for ; Mon, 24 Aug 2015 01:07:22 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:content-type; bh=ULv3skngq11DRP6ZClkgkpi2vtjT8S/3miCknmLCHcU=; b=mTYNIMI/EGgEu9XJCW9qtCp3nKqP7bgmWFs7pNOJZFsnyehZ27Q9HmIpuh3wIYLBbc QcbevBk4uQXOsZNSP+rRDBx1zRJ2hm5mAcOQUqm+0UR7uxKw3VJfXTuAHS3Pi/7OhlW0 EPxtpAx/QVuyJjmFNQemug6JGhSpj9RW+gISCKPBD7fIB0mFl54Tr2h5oNJwMrZTsBWm OIA1e39aD4oBp88Uwnyps9xSPqOI51FHZt2Dleh87x0MHgdBXiPJ0L1fKy7DwaTqeOhQ VVeGUIB6zPEb9uMRKhiFUQkDNEEWHrobNplUrS/9Tgkg4S1m4/8/kmed2jGsSlUgtpLS HVzw== X-Gm-Message-State: ALoCoQkbDWUIWN84brrTuNF7AVsARhE7S7mYjNTIpNxGRSlNQXAW++HGQ+egBIU/GEOuP9Hof2n1 X-Received: by 10.107.37.142 with SMTP id l136mr17617468iol.126.1440403642026; Mon, 24 Aug 2015 01:07:22 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Harsh J Date: Mon, 24 Aug 2015 08:07:12 +0000 Message-ID: Subject: Re: MultithreadedMapper - Sharing Data Structure To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a1140269a6ab369051e0a1b2f --001a1140269a6ab369051e0a1b2f Content-Type: text/plain; charset=UTF-8 The MultiThreadedMapper won't solve your problem, as all it does is run parallel maps within the same map task JVM as a non-MT one. Your data structure won't be shared across the different map task JVMs on the host, but just within the map tasks's own multiple threads running the map() function over input records. Wouldn't doing reduce-side join for larger files be much faster? On Sun, Aug 23, 2015 at 5:08 AM Pedro Magalhaes wrote: > I am developig a job that has 30B of records in the input path. (File A) > I need to filter these records using another file that can have 30K to > 180M of records. (File B) > So fo each record in File A, i will make a lookup in File B. > I am using distributed cache to share the File B. The problem is that if > the File B is too large (for example 180 M of records), i spend too much > time (CPU processing) allocating it in a hashmap. I make this allocation to > each map task. > > In hadoop 2.X the jvm reuse was discontinued. So i am think in use MultithreadedMapper, > making the hashmap thread-safe, and sharing this read-only structure across > the mappers. > > Is this a good approach? > > > > --001a1140269a6ab369051e0a1b2f Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
The MultiThreadedMapper won't solve your problem, as a= ll it does is run parallel maps within the same map task JVM as a non-MT on= e. Your data structure won't be shared across the different map task JV= Ms on the host, but just within the map tasks's own multiple threads ru= nning the map() function over input records.

Wouldn'= t doing reduce-side join for larger files be much faster?

On Sun, Aug 23, 2015 at 5:08 AM P= edro Magalhaes <pedrorjbr@gmail.c= om> wrote:
= I am developig a job that has 30B of records in the input path. (File A)I need to filter these records using another file that can have 30K to 18= 0M of records. (File B)
So fo each record in File A, i will make = a lookup in File B.
I am using distributed cache to share the Fil= e B. The problem is that if the File B is too large (for example 180 M of r= ecords), i spend too much time (CPU processing) allocating it in a hashmap.= I make this allocation to each map task.

In hadoo= p 2.X the jvm reuse was discontinued. So i am think in use=C2=A0MultithreadedMapper, maki= ng the hashmap thread-safe, and sharing this read-only structure across the= mappers.

Is this a good approach?


=
=C2=A0
--001a1140269a6ab369051e0a1b2f--