Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-user@hadoop.apache.org
Received-SPF: neutral (athena.apache.org: 216.145.54.173 is neither permitted
 nor denied by domain of evans@yahoo-inc.com)
From: Robert Evans <evans@yahoo-inc.com>
To: "mapreduce-user@hadoop.apache.org" <mapreduce-user@hadoop.apache.org>
Date: Mon, 7 Nov 2011 08:26:09 -0800
Subject: Re: On the relationship between MultithreadedMapper and tasks
Thread-Topic: On the relationship between MultithreadedMapper and tasks
Thread-Index: Acyb4ypCskslYca3SjmHh4vGR/bQ7gBhsp7w
Message-ID: <CADD62C1.2E9CD%evans@yahoo-inc.com>
In-Reply-To: <4EB576B1.6000702@tis.bz.it>
Accept-Language: en-US
Content-Language: en
acceptlanguage: en-US
Content-Type: multipart/alternative;
	boundary="_000_CADD62C12E9CDevansyahooinccom_"
MIME-Version: 1.0

--_000_CADD62C12E9CDevansyahooinccom_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

What you are doing is commonly called a map/side join.  You can often impro=
ve your performance by increasing the replication count for the smaller fil=
es, 10 is usually a good number, but you can play around with it, and by br=
inging them in using the distributed cache instead of reading them directly=
 from HDFS.

To answer your original question what I would suggest you do is to have a s=
ynchronized static method dot he setup.

Public class ...
    private static boolean dataHasBeenRead =3D false
    private static synchronized void internalSetup(...) {
        if(!dataHasBeenRead) {
            //Read the data
           ...
            dataHasBeenRead =3D true;
        }
    }
...
}

You can then call this from your original setup method, or you can switch t=
o something like pig that handles all of this for you behind the scenes.

--Bobby Evans

On 11/5/11 11:47 AM, "Claudio Martella" <claudio.martella@tis.bz.it> wrote:

Hello list,

I have a task where I have compare the entries of a big sequencefile
with the entries of many small sequencefiles.

Basically you could describe it like this:

for entry in bigSequenceFile:
    for file in listOfSmallFiles:
        for entry2 in file:
            compare(entry, entry2)


My approach is to use a map-only job. I put the big file in HDFS in the
directory which will be the inputPath for the job and the small files in
HDFS in another directory.
Then in my mapper I will load all the small files in memory and will
compare them to the records which will be sent to me in map() from the
big file.

Something like this:

map(key, value):
    for list in smallFiles:
        for entry in list:
            compare(value, entry)

As I don't want, on each node, to load the files multiple times wasting
memory, I thought about using a MultithreadedMapper, load the my small
files at setup() (ideally called once at the instantiation of the
object) into a datastructure (i.e. a LinkedList, smallFiles in the
pseudocode) shared between the threads.

So basically each node would create only one task with more threads each
running on one CPU and sharing the in-memory datastructure.

The problem I'm facing is that for each thread the setup() method is
called once (so #threads times in total), which is not what I want as it
will load the smallFiles multiple times on a shared datastructure.

Also I'm afraid I'm not able to control the number of these
MultithreadedMappers ran on each node (I want only one).

Can anybody help me out understanding how I can leverage
MultithreadedMapper to get what I want (there's very little info on the
internet about MultithreadedMapper)?

Thanks!
CM


--
Claudio Martella
Free Software & Open Technologies
Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.martella@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 o=
f Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we=
 process your personal data in order to fulfil contractual and fiscal oblig=
ations and also to send you information regarding our services and events. =
Your personal data are processed with and without electronic means and by r=
especting data subjects' rights, fundamental freedoms and dignity, particul=
arly with regard to confidentiality, personal identity and the right to per=
sonal data protection. At any time and without formalities you can write an=
 e-mail to privacy@tis.bz.it in order to object the processing of your pers=
onal data for the purpose of sending advertising materials and also to exer=
cise the right to access personal data and other rights referred to in Sect=
ion 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto=
 Adige, Siemens Street n. 19, Bolzano. You can find the complete informatio=
n on the web site www.tis.bz.it.


--_000_CADD62C12E9CDevansyahooinccom_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<HTML>
<HEAD>
<TITLE>Re: On the relationship between MultithreadedMapper and tasks</TITLE=
>
</HEAD>
<BODY>
<FONT FACE=3D"Calibri, Verdana, Helvetica, Arial"><SPAN STYLE=3D'font-size:=
11pt'>What you are doing is commonly called a map/side join. &nbsp;You can =
often improve your performance by increasing the replication count for the =
smaller files, 10 is usually a good number, but you can play around with it=
, and by bringing them in using the distributed cache instead of reading th=
em directly from HDFS.<BR>
<BR>
To answer your original question what I would suggest you do is to have a s=
ynchronized static method dot he setup.<BR>
<BR>
Public class ...<BR>
&nbsp;&nbsp;&nbsp;&nbsp;private static boolean dataHasBeenRead =3D false<BR=
>
&nbsp;&nbsp;&nbsp;&nbsp;private static synchronized void internalSetup(...)=
 {<BR>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if(!dataHasBeenRead) {<BR>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;//R=
ead the data<BR>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;...<BR>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;dat=
aHasBeenRead =3D true;<BR>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<BR>
&nbsp;&nbsp;&nbsp;&nbsp;}<BR>
...<BR>
}<BR>
<BR>
You can then call this from your original setup method, or you can switch t=
o something like pig that handles all of this for you behind the scenes.<BR=
>
<BR>
--Bobby Evans<BR>
<BR>
On 11/5/11 11:47 AM, &quot;Claudio Martella&quot; &lt;<a href=3D"claudio.ma=
rtella@tis.bz.it">claudio.martella@tis.bz.it</a>&gt; wrote:<BR>
<BR>
</SPAN></FONT><BLOCKQUOTE><FONT FACE=3D"Calibri, Verdana, Helvetica, Arial"=
><SPAN STYLE=3D'font-size:11pt'>Hello list,<BR>
<BR>
I have a task where I have compare the entries of a big sequencefile<BR>
with the entries of many small sequencefiles.<BR>
<BR>
Basically you could describe it like this:<BR>
<BR>
for entry in bigSequenceFile:<BR>
&nbsp;&nbsp;&nbsp;&nbsp;for file in listOfSmallFiles:<BR>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;for entry2 in file:<BR>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;com=
pare(entry, entry2)<BR>
<BR>
<BR>
My approach is to use a map-only job. I put the big file in HDFS in the<BR>
directory which will be the inputPath for the job and the small files in<BR=
>
HDFS in another directory.<BR>
Then in my mapper I will load all the small files in memory and will<BR>
compare them to the records which will be sent to me in map() from the<BR>
big file.<BR>
<BR>
Something like this:<BR>
<BR>
map(key, value):<BR>
&nbsp;&nbsp;&nbsp;&nbsp;for list in smallFiles:<BR>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;for entry in list:<BR>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;com=
pare(value, entry)<BR>
<BR>
As I don't want, on each node, to load the files multiple times wasting<BR>
memory, I thought about using a MultithreadedMapper, load the my small<BR>
files at setup() (ideally called once at the instantiation of the<BR>
object) into a datastructure (i.e. a LinkedList, smallFiles in the<BR>
pseudocode) shared between the threads.<BR>
<BR>
So basically each node would create only one task with more threads each<BR=
>
running on one CPU and sharing the in-memory datastructure.<BR>
<BR>
The problem I'm facing is that for each thread the setup() method is<BR>
called once (so #threads times in total), which is not what I want as it<BR=
>
will load the smallFiles multiple times on a shared datastructure.<BR>
<BR>
Also I'm afraid I'm not able to control the number of these<BR>
MultithreadedMappers ran on each node (I want only one).<BR>
<BR>
Can anybody help me out understanding how I can leverage<BR>
MultithreadedMapper to get what I want (there's very little info on the<BR>
internet about MultithreadedMapper)?<BR>
<BR>
Thanks!<BR>
CM<BR>
<BR>
<BR>
--<BR>
Claudio Martella<BR>
Free Software &amp; Open Technologies<BR>
Analyst<BR>
<BR>
TIS innovation park<BR>
Via Siemens 19 | Siemensstr. 19<BR>
39100 Bolzano | 39100 Bozen<BR>
Tel. +39 0471 068 123<BR>
Fax &nbsp;+39 0471 068 129<BR>
<a href=3D"claudio.martella@tis.bz.it">claudio.martella@tis.bz.it</a> <a hr=
ef=3D"http://www.tis.bz.it">http://www.tis.bz.it</a><BR>
<BR>
Short information regarding use of personal data. According to Section 13 o=
f Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we=
 process your personal data in order to fulfil contractual and fiscal oblig=
ations and also to send you information regarding our services and events. =
Your personal data are processed with and without electronic means and by r=
especting data subjects' rights, fundamental freedoms and dignity, particul=
arly with regard to confidentiality, personal identity and the right to per=
sonal data protection. At any time and without formalities you can write an=
 e-mail to <a href=3D"privacy@tis.bz.it">privacy@tis.bz.it</a> in order to =
object the processing of your personal data for the purpose of sending adve=
rtising materials and also to exercise the right to access personal data an=
d other rights referred to in Section 7 of Decree 196/2003. The data contro=
ller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. Yo=
u can find the complete information on the web site www.tis.bz.it.<BR>
<BR>
<BR>
<BR>
<BR>
<BR>
</SPAN></FONT></BLOCKQUOTE>
</BODY>
</HTML>


--_000_CADD62C12E9CDevansyahooinccom_--