hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Lilley <john.lil...@redpoint.net>
Subject Distribution of native executables and data for YARN-based execution
Date Thu, 16 May 2013 21:21:49 GMT
I am attempting to distribute the execution of a C-based program onto a Hadoop cluster, without
using MapReduce.  I read that YARN can be used to schedule non-MapReduce applications by programming
to the ASM/RM interfaces.  As I understand it, eventually I get down to specifying each sub-task
via ContainerLaunchContext.setCommands().

However, the program and shared libraries need to be stored on each worker's local disk to
run.  In addition there is a hefty data set that the application uses (say, 4GB) that is accessed
via regular open()/read() calls by a library.  I thought a decent strategy would be to push
the program+data package to a known folder in HDFS, then launch a "bootstrap" that compared
the HDFS folder version to a local folder, copying any updated files as needed before launching
the native application task.

Are there better approaches?  I notice that one can implicitly copy "local resources" as part
of the launch, but I don't want to copy 4GB every time, only occasionally when the application
or reference data is updated.  Also, will my bootstrapper be allowed to set executable-mode
bits on the programs after they are copied?


View raw message