Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B831E7DDE for ; Mon, 7 Nov 2011 16:27:12 +0000 (UTC) Received: (qmail 33826 invoked by uid 500); 7 Nov 2011 16:27:11 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 33779 invoked by uid 500); 7 Nov 2011 16:27:11 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 33771 invoked by uid 99); 7 Nov 2011 16:27:11 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Nov 2011 16:27:11 +0000 X-ASF-Spam-Status: No, hits=2.6 required=5.0 tests=HTML_MESSAGE,NO_RDNS_DOTCOM_HELO,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: 216.145.54.173 is neither permitted nor denied by domain of evans@yahoo-inc.com) Received: from [216.145.54.173] (HELO mrout3.yahoo.com) (216.145.54.173) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Nov 2011 16:27:05 +0000 Received: from sp1-ex07cas01.ds.corp.yahoo.com (sp1-ex07cas01.ds.corp.yahoo.com [216.252.116.137]) by mrout3.yahoo.com (8.14.4/8.14.4/y.out) with ESMTP id pA7GQDPq054072 for ; Mon, 7 Nov 2011 08:26:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=yahoo-inc.com; s=cobra; t=1320683173; bh=JxUKVmKVz6s3PZ059Km0K4MKoO6ByDuXpKBaGAAV6DY=; h=From:To:Date:Subject:Message-ID:In-Reply-To:Content-Type: MIME-Version; b=PasHOCH/JTqCsOg7h1FDs1XHJ1fRdGEO2TsDBs5XQTncTUhQ4DEb92Hs9/YmiuqYS fCftBqzskl8mzhY2pVZX6oOLgw5z8zNtQUef89fovTCSYDH3pZDp6j4VhLRaCBTH75 OPvI6/v7Rwu/SJSLJHZ1+ZkVsh5sYXTf5QUKBpX4= Received: from SP1-EX07VS02.ds.corp.yahoo.com ([216.252.116.135]) by sp1-ex07cas01.ds.corp.yahoo.com ([216.252.116.137]) with mapi; Mon, 7 Nov 2011 08:26:13 -0800 From: Robert Evans To: "mapreduce-user@hadoop.apache.org" Date: Mon, 7 Nov 2011 08:26:09 -0800 Subject: Re: On the relationship between MultithreadedMapper and tasks Thread-Topic: On the relationship between MultithreadedMapper and tasks Thread-Index: Acyb4ypCskslYca3SjmHh4vGR/bQ7gBhsp7w Message-ID: In-Reply-To: <4EB576B1.6000702@tis.bz.it> Accept-Language: en-US Content-Language: en X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: multipart/alternative; boundary="_000_CADD62C12E9CDevansyahooinccom_" MIME-Version: 1.0 --_000_CADD62C12E9CDevansyahooinccom_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable What you are doing is commonly called a map/side join. You can often impro= ve your performance by increasing the replication count for the smaller fil= es, 10 is usually a good number, but you can play around with it, and by br= inging them in using the distributed cache instead of reading them directly= from HDFS. To answer your original question what I would suggest you do is to have a s= ynchronized static method dot he setup. Public class ... private static boolean dataHasBeenRead =3D false private static synchronized void internalSetup(...) { if(!dataHasBeenRead) { //Read the data ... dataHasBeenRead =3D true; } } ... } You can then call this from your original setup method, or you can switch t= o something like pig that handles all of this for you behind the scenes. --Bobby Evans On 11/5/11 11:47 AM, "Claudio Martella" wrote: Hello list, I have a task where I have compare the entries of a big sequencefile with the entries of many small sequencefiles. Basically you could describe it like this: for entry in bigSequenceFile: for file in listOfSmallFiles: for entry2 in file: compare(entry, entry2) My approach is to use a map-only job. I put the big file in HDFS in the directory which will be the inputPath for the job and the small files in HDFS in another directory. Then in my mapper I will load all the small files in memory and will compare them to the records which will be sent to me in map() from the big file. Something like this: map(key, value): for list in smallFiles: for entry in list: compare(value, entry) As I don't want, on each node, to load the files multiple times wasting memory, I thought about using a MultithreadedMapper, load the my small files at setup() (ideally called once at the instantiation of the object) into a datastructure (i.e. a LinkedList, smallFiles in the pseudocode) shared between the threads. So basically each node would create only one task with more threads each running on one CPU and sharing the in-memory datastructure. The problem I'm facing is that for each thread the setup() method is called once (so #threads times in total), which is not what I want as it will load the smallFiles multiple times on a shared datastructure. Also I'm afraid I'm not able to control the number of these MultithreadedMappers ran on each node (I want only one). Can anybody help me out understanding how I can leverage MultithreadedMapper to get what I want (there's very little info on the internet about MultithreadedMapper)? Thanks! CM -- Claudio Martella Free Software & Open Technologies Analyst TIS innovation park Via Siemens 19 | Siemensstr. 19 39100 Bolzano | 39100 Bozen Tel. +39 0471 068 123 Fax +39 0471 068 129 claudio.martella@tis.bz.it http://www.tis.bz.it Short information regarding use of personal data. According to Section 13 o= f Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we= process your personal data in order to fulfil contractual and fiscal oblig= ations and also to send you information regarding our services and events. = Your personal data are processed with and without electronic means and by r= especting data subjects' rights, fundamental freedoms and dignity, particul= arly with regard to confidentiality, personal identity and the right to per= sonal data protection. At any time and without formalities you can write an= e-mail to privacy@tis.bz.it in order to object the processing of your pers= onal data for the purpose of sending advertising materials and also to exer= cise the right to access personal data and other rights referred to in Sect= ion 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto= Adige, Siemens Street n. 19, Bolzano. You can find the complete informatio= n on the web site www.tis.bz.it. --_000_CADD62C12E9CDevansyahooinccom_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Re: On the relationship between MultithreadedMapper and tasks</TITLE= > </HEAD> <BODY> <FONT FACE=3D"Calibri, Verdana, Helvetica, Arial"><SPAN STYLE=3D'font-size:= 11pt'>What you are doing is commonly called a map/side join.  You can = often improve your performance by increasing the replication count for the = smaller files, 10 is usually a good number, but you can play around with it= , and by bringing them in using the distributed cache instead of reading th= em directly from HDFS.<BR> <BR> To answer your original question what I would suggest you do is to have a s= ynchronized static method dot he setup.<BR> <BR> Public class ...<BR>     private static boolean dataHasBeenRead =3D false<BR= >     private static synchronized void internalSetup(...)= {<BR>         if(!dataHasBeenRead) {<BR>             //R= ead the data<BR>            ...<BR>             dat= aHasBeenRead =3D true;<BR>         }<BR>     }<BR> ...<BR> }<BR> <BR> You can then call this from your original setup method, or you can switch t= o something like pig that handles all of this for you behind the scenes.<BR= > <BR> --Bobby Evans<BR> <BR> On 11/5/11 11:47 AM, "Claudio Martella" <<a href=3D"claudio.ma= rtella@tis.bz.it">claudio.martella@tis.bz.it</a>> wrote:<BR> <BR> </SPAN></FONT><BLOCKQUOTE><FONT FACE=3D"Calibri, Verdana, Helvetica, Arial"= ><SPAN STYLE=3D'font-size:11pt'>Hello list,<BR> <BR> I have a task where I have compare the entries of a big sequencefile<BR> with the entries of many small sequencefiles.<BR> <BR> Basically you could describe it like this:<BR> <BR> for entry in bigSequenceFile:<BR>     for file in listOfSmallFiles:<BR>         for entry2 in file:<BR>             com= pare(entry, entry2)<BR> <BR> <BR> My approach is to use a map-only job. I put the big file in HDFS in the<BR> directory which will be the inputPath for the job and the small files in<BR= > HDFS in another directory.<BR> Then in my mapper I will load all the small files in memory and will<BR> compare them to the records which will be sent to me in map() from the<BR> big file.<BR> <BR> Something like this:<BR> <BR> map(key, value):<BR>     for list in smallFiles:<BR>         for entry in list:<BR>             com= pare(value, entry)<BR> <BR> As I don't want, on each node, to load the files multiple times wasting<BR> memory, I thought about using a MultithreadedMapper, load the my small<BR> files at setup() (ideally called once at the instantiation of the<BR> object) into a datastructure (i.e. a LinkedList, smallFiles in the<BR> pseudocode) shared between the threads.<BR> <BR> So basically each node would create only one task with more threads each<BR= > running on one CPU and sharing the in-memory datastructure.<BR> <BR> The problem I'm facing is that for each thread the setup() method is<BR> called once (so #threads times in total), which is not what I want as it<BR= > will load the smallFiles multiple times on a shared datastructure.<BR> <BR> Also I'm afraid I'm not able to control the number of these<BR> MultithreadedMappers ran on each node (I want only one).<BR> <BR> Can anybody help me out understanding how I can leverage<BR> MultithreadedMapper to get what I want (there's very little info on the<BR> internet about MultithreadedMapper)?<BR> <BR> Thanks!<BR> CM<BR> <BR> <BR> --<BR> Claudio Martella<BR> Free Software & Open Technologies<BR> Analyst<BR> <BR> TIS innovation park<BR> Via Siemens 19 | Siemensstr. 19<BR> 39100 Bolzano | 39100 Bozen<BR> Tel. +39 0471 068 123<BR> Fax  +39 0471 068 129<BR> <a href=3D"claudio.martella@tis.bz.it">claudio.martella@tis.bz.it</a> <a hr= ef=3D"http://www.tis.bz.it">http://www.tis.bz.it</a><BR> <BR> Short information regarding use of personal data. According to Section 13 o= f Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we= process your personal data in order to fulfil contractual and fiscal oblig= ations and also to send you information regarding our services and events. = Your personal data are processed with and without electronic means and by r= especting data subjects' rights, fundamental freedoms and dignity, particul= arly with regard to confidentiality, personal identity and the right to per= sonal data protection. At any time and without formalities you can write an= e-mail to <a href=3D"privacy@tis.bz.it">privacy@tis.bz.it</a> in order to = object the processing of your personal data for the purpose of sending adve= rtising materials and also to exercise the right to access personal data an= d other rights referred to in Section 7 of Decree 196/2003. The data contro= ller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. Yo= u can find the complete information on the web site www.tis.bz.it.<BR> <BR> <BR> <BR> <BR> <BR> </SPAN></FONT></BLOCKQUOTE> </BODY> </HTML> --_000_CADD62C12E9CDevansyahooinccom_--