Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: local policy)
From: Luca Pireddu <pireddu@crs4.it>
Organization: CRS4
To: common-user@hadoop.apache.org
Subject: Re: Memory mapped resources
Date: Wed, 13 Apr 2011 09:21:49 +0200
User-Agent: KMail/1.13.5 (Linux/2.6.35-28-generic; KDE/4.5.1; x86_64; ; )
References: <BANLkTi=7mGQwvQjbLm+4wx3Kupc0s0rOXA@mail.gmail.com>
 <BANLkTimOqZS28_rz2T+oyA4N69S_TX5tFQ@mail.gmail.com>
 <BANLkTinbmnxzW8+4uT--iAyTbL5X+QT39Q@mail.gmail.com>
In-Reply-To: <BANLkTinbmnxzW8+4uT--iAyTbL5X+QT39Q@mail.gmail.com>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <201104130921.50008.pireddu@crs4.it>

On April 12, 2011 21:50:07 Luke Lu wrote:
> You can use distributed cache for memory mapped files (they're local
> to the node the tasks run on.)
> 
> http://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata

We adopted this solution for a similar problem.  For a program we developed 
each mapper needed to access (read-only) an index about 4 GB in size.  We 
distributed the index to each node with the distributed cache, and then 
accessed it with mmap.  The 4 GB are loaded into memory, but shared by all the 
map tasks on the same node.  The mapper is written in C, so we can call mmap 
directly.  In Java you may be able to get the same effect with 
java.nio.channels.FileChannel.

Luca


> On Tue, Apr 12, 2011 at 10:40 AM, Benson Margulies
> 
> <bimargulies@gmail.com> wrote:
> > Here's the OP again.
> > 
> > I want to make it clear that my question here has to do with the
> > problem of distributing 'the program' around the cluster, not 'the
> > data'. In the case at hand, the issue a system that has a large data
> > resource that it needs to do its work. Every instance of the code
> > needs the entire model. Not just some blocks or pieces.
> > 
> > Memory mapping is a very attractive tactic for this kind of data
> > resource. The data is read-only. Memory-mapping it allows the
> > operating system to ensure that only one copy of the thing ends up in
> > physical memory.
> > 
> > If we force the model into a conventional file (storable in HDFS) and
> > read it into the JVM in a conventional way, then we get as many copies
> > in memory as we have JVMs.  On a big machine with a lot of cores, this
> > begins to add up.
> > 
> > For people who are running a cluster of relatively conventional
> > systems, just putting copies on all the nodes in a conventional place
> > is adequate.

-- 
Luca Pireddu
CRS4 - Distributed Computing Group
Loc. Pixina Manna Edificio 1
Pula 09010 (CA), Italy
Tel:  +39 0709250452