Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1E149DA2D for ; Fri, 17 May 2013 06:56:23 +0000 (UTC) Received: (qmail 51152 invoked by uid 500); 17 May 2013 06:56:18 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 50623 invoked by uid 500); 17 May 2013 06:56:14 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 50580 invoked by uid 99); 17 May 2013 06:56:13 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 May 2013 06:56:13 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of vinodkv@hortonworks.com designates 209.85.213.171 as permitted sender) Received: from [209.85.213.171] (HELO mail-ye0-f171.google.com) (209.85.213.171) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 May 2013 06:56:07 +0000 Received: by mail-ye0-f171.google.com with SMTP id l12so823744yen.16 for ; Thu, 16 May 2013 23:55:46 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:from:mime-version:content-type:subject:date:in-reply-to :to:references:message-id:x-mailer:x-gm-message-state; bh=2Tt085S36zC/Rvx0Q/nOuUkVUEvqniXeYbnGC96BCjM=; b=gMiqSL++1r7b1EN143KZrveKExaoVn2rW2NkIcLwDm5g6MzfYdKb+RT5Gt1Z1CTfYF ZRajZ86cUuydelf/NHcpUbTpZe375CO+U9d/4jy5d3gGK7aOrXn4QwVPZKkNW49/cZza bKOPA6EqaYWtnlnqDdD7rFjgF1veVi7G/EFTQAEiPY+zBb2a5hC+86JfZmIoTdXJpsut MdYhJvnVBLdBgY8DSE8kgMb8ohhItni2KjhFPXs821UC2ki5RHav3fZDu6s3uNxq+lCx 7qX2N1foAEU1miDx8N8VOS9QFu652c+d2kE6WimyZHh5HF5C3OVZ5f1b3vjcyyMcVKVx oNcA== X-Received: by 10.236.202.143 with SMTP id d15mr25034733yho.16.1368773746761; Thu, 16 May 2013 23:55:46 -0700 (PDT) Received: from spacestar.att.net (108-233-124-157.lightspeed.sntcca.sbcglobal.net. [108.233.124.157]) by mx.google.com with ESMTPSA id g70sm15402371yhm.7.2013.05.16.23.55.44 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 16 May 2013 23:55:45 -0700 (PDT) From: Vinod Kumar Vavilapalli Mime-Version: 1.0 (Apple Message framework v1283) Content-Type: multipart/alternative; boundary="Apple-Mail=_9ECF7484-8636-4FBC-9313-D3DEF38C6B7B" Subject: Re: Distribution of native executables and data for YARN-based execution Date: Thu, 16 May 2013 23:55:41 -0700 In-Reply-To: <869970D71E26D7498BDAC4E1CA92226B65898675@MBX021-E3-NJ-2.exch021.domain.local> To: user@hadoop.apache.org References: <869970D71E26D7498BDAC4E1CA92226B65898675@MBX021-E3-NJ-2.exch021.domain.local> Message-Id: <953DBF83-D22F-468F-87F3-042BDCBF724D@apache.org> X-Mailer: Apple Mail (2.1283) X-Gm-Message-State: ALoCoQnygiqRmGw/ColEcOWtN7DSmSyCUqCoF6WCg9idjRlwaeY9PNeKIgdv8n6x0xruC4kEg4qq X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_9ECF7484-8636-4FBC-9313-D3DEF38C6B7B Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=windows-1252 The "local resources" you mentioned is the exact solution for this. For = each LocalResource, you also mention a LocalResourceVisibility which = takes one of the three values today - PUBLIC, PRIVATE and APPLICATON. PUBLIC resources are downloaded only once and shared by any application = running on that node. PRIVATE resources are downloaded only once and shared by any application = run by the same user on that node APPLICATION resources are downloaded per application and removed after = the application finishes. Seems like you want PUBLIC or PRIVATE. Note that for PUBLIC resources to work, the corresponding files need to = be public on HDFS too. Also if the remote files on HDFS are updated, these local files will be = uploaded afresh again on each node where your containers run. HTH Thanks, +Vinod Kumar Vavilapalli Hortonworks Inc. http://hortonworks.com/ On May 16, 2013, at 2:21 PM, John Lilley wrote: > I am attempting to distribute the execution of a C-based program onto = a Hadoop cluster, without using MapReduce. I read that YARN can be used = to schedule non-MapReduce applications by programming to the ASM/RM = interfaces. As I understand it, eventually I get down to specifying = each sub-task via ContainerLaunchContext.setCommands(). > =20 > However, the program and shared libraries need to be stored on each = worker=92s local disk to run. In addition there is a hefty data set = that the application uses (say, 4GB) that is accessed via regular = open()/read() calls by a library. I thought a decent strategy would be = to push the program+data package to a known folder in HDFS, then launch = a =93bootstrap=94 that compared the HDFS folder version to a local = folder, copying any updated files as needed before launching the native = application task. > =20 > Are there better approaches? I notice that one can implicitly copy = =93local resources=94 as part of the launch, but I don=92t want to copy = 4GB every time, only occasionally when the application or reference data = is updated. Also, will my bootstrapper be allowed to set = executable-mode bits on the programs after they are copied? > =20 > Thanks > John > =20 --Apple-Mail=_9ECF7484-8636-4FBC-9313-D3DEF38C6B7B Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=windows-1252

The "local resources" you = mentioned is the exact solution for this. For each LocalResource, you = also mention a LocalResourceVisibility which takes one of the three = values today - PUBLIC, PRIVATE and = APPLICATON.

PUBLIC resources are downloaded = only once and shared by any application running on that = node.

PRIVATE resources are downloaded only = once and shared by any application run by the same user on that = node

APPLICATION resources are downloaded per = application and removed after the application = finishes.

Seems like you want PUBLIC or = PRIVATE.

Note that for PUBLIC resources to = work, the corresponding files need to be public on HDFS = too.

Also if the remote files on HDFS are = updated, these local files will be uploaded afresh again on each node = where your containers run.

HTH

Thanks,
+Vinod Kumar = Vavilapalli
Hortonworks Inc.
http://hortonworks.com/


On May 16, 2013, at 2:21 PM, John Lilley wrote:

I am attempting to distribute = the execution of a C-based program onto a Hadoop cluster, without using = MapReduce.  I read that YARN can be used to schedule non-MapReduce = applications by programming to the ASM/RM interfaces.  As I = understand it, eventually I get down to specifying each sub-task via = ContainerLaunchContext.setCommands().
 
However, the program and = shared libraries need to be stored on each worker=92s local disk to = run.  In addition there is a hefty data set that the application = uses (say, 4GB) that is accessed via regular open()/read() calls by a = library.  I thought a decent strategy would be to push the = program+data package to a known folder in HDFS, then launch a = =93bootstrap=94 that compared the HDFS folder version to a local folder, = copying any updated files as needed before launching the native = application task.
 
Are there better approaches?  I notice that one can = implicitly copy =93local resources=94 as part of the launch, but I don=92t= want to copy 4GB every time, only occasionally when the application or = reference data is updated.  Also, will my bootstrapper be allowed = to set executable-mode bits on the programs after they are = copied?
 
Thanks
John
 

= = --Apple-Mail=_9ECF7484-8636-4FBC-9313-D3DEF38C6B7B--