Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 14059 invoked from network); 13 Apr 2011 07:22:30 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 13 Apr 2011 07:22:30 -0000 Received: (qmail 29943 invoked by uid 500); 13 Apr 2011 07:22:27 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 29892 invoked by uid 500); 13 Apr 2011 07:22:27 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 29883 invoked by uid 99); 13 Apr 2011 07:22:26 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Apr 2011 07:22:26 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [156.148.72.33] (HELO raffaello.crs4.it) (156.148.72.33) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Apr 2011 07:22:19 +0000 Received: from slynx.localnet (slynx.crs4.it [156.148.72.124]) by raffaello.crs4.it (Postfix) with ESMTP id 26F52790911 for ; Wed, 13 Apr 2011 09:21:56 +0200 (CEST) From: Luca Pireddu Organization: CRS4 To: common-user@hadoop.apache.org Subject: Re: Memory mapped resources Date: Wed, 13 Apr 2011 09:21:49 +0200 User-Agent: KMail/1.13.5 (Linux/2.6.35-28-generic; KDE/4.5.1; x86_64; ; ) References: In-Reply-To: MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201104130921.50008.pireddu@crs4.it> On April 12, 2011 21:50:07 Luke Lu wrote: > You can use distributed cache for memory mapped files (they're local > to the node the tasks run on.) > > http://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata We adopted this solution for a similar problem. For a program we developed each mapper needed to access (read-only) an index about 4 GB in size. We distributed the index to each node with the distributed cache, and then accessed it with mmap. The 4 GB are loaded into memory, but shared by all the map tasks on the same node. The mapper is written in C, so we can call mmap directly. In Java you may be able to get the same effect with java.nio.channels.FileChannel. Luca > On Tue, Apr 12, 2011 at 10:40 AM, Benson Margulies > > wrote: > > Here's the OP again. > > > > I want to make it clear that my question here has to do with the > > problem of distributing 'the program' around the cluster, not 'the > > data'. In the case at hand, the issue a system that has a large data > > resource that it needs to do its work. Every instance of the code > > needs the entire model. Not just some blocks or pieces. > > > > Memory mapping is a very attractive tactic for this kind of data > > resource. The data is read-only. Memory-mapping it allows the > > operating system to ensure that only one copy of the thing ends up in > > physical memory. > > > > If we force the model into a conventional file (storable in HDFS) and > > read it into the JVM in a conventional way, then we get as many copies > > in memory as we have JVMs. On a big machine with a lot of cores, this > > begins to add up. > > > > For people who are running a cluster of relatively conventional > > systems, just putting copies on all the nodes in a conventional place > > is adequate. -- Luca Pireddu CRS4 - Distributed Computing Group Loc. Pixina Manna Edificio 1 Pula 09010 (CA), Italy Tel: +39 0709250452